为了账号安全,请及时绑定邮箱和手机立即绑定

使用 selenium 和 python 进行网络抓取时删除 <br> 标签以正确对齐

使用 selenium 和 python 进行网络抓取时删除 <br> 标签以正确对齐

qq_花开花谢_0 2023-04-25 16:52:44
我想<br>在网络抓取页面时删除 html 标签,但替换似乎不起作用。我不确定是否有另一种方法或更好的方法使用 selenium 和 python 来做到这一点。先感谢您。from selenium import webdriverfrom selenium.webdriver.support.ui import Selectfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome("drivers/chromedriver")driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")state_drop = driver.find_element_by_id("state")state = Select(state_drop)state.select_by_visible_text("New Hampshire")driver.find_element_by_id("city").send_keys("Moultonborough")driver.find_element_by_id("name").send_keys("Moultonborough Academy")driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)driver.find_element_by_id("hsSelectRadio_1").click()courses_subheading = driver.find_elements_by_tag_name("th.header")print(courses_subheading[0].text, "     " ,courses_subheading[1].text, "     ", courses_subheading[2].text, "     ", courses_subheading[3].text, "     ", courses_subheading[4].text我试过这个:for i in courses_subheading:    courses_subheading.replace("<br>", " ")但得到一个错误:AttributeError: 'list' object has no attribute 'replace'目前,它看起来像这样:CourseWeight     Title     Notes     MaxCredits       OKThrough       DisabilityCourse但我想要这样:Course Weight     Title     Notes     Max Credits     OK     Through     Disability Course
查看完整描述

2 回答

?
肥皂起泡泡

TA贡献1829条经验 获得超6个赞

无需删除,<br>您可以轻松避免<br>标签。要打印表格标题,例如 TitleNotes等,您需要为 引入WebDriverWait并且visibility_of_all_elements_located()您可以使用以下任一Locator Strategies:

使用css_selector:


driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire")

driver.find_element_by_css_selector("input#city").send_keys("Moultonborough")

driver.find_element_by_css_selector("input#name").send_keys("Moultonborough Academy")

driver.find_element_by_css_selector("input[value='Search']").click()

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='hsCode']"))).click()

print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#approvedCourseTable_1 th.header")))])

使用xpath:


driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire")

driver.find_element_by_xpath("//input[@id='city']").send_keys("Moultonborough")

driver.find_element_by_xpath("//input[@id='name']").send_keys("Moultonborough Academy")

driver.find_element_by_xpath("//input[@value='Search']").click()

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='hsCode']"))).click()

print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='approvedCourseTable_1']//th[@class='header']")))])

控制台输出:


['Course\nWeight', 'Title', 'Notes', 'Max\nCredits', 'OK\nThrough', 'Disability\nCourse']

注意:您必须添加以下导入:


from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as EC


查看完整回答
反对 回复 2023-04-25
?
拉丁的传说

TA贡献1789条经验 获得超8个赞

要完成,如果你真的想删除标签br,你可以使用(我已经修复了你的 XPath 表达式):


import re

courses_subheading = driver.find_elements_by_xpath("(//tr[th[@class='header']])[1]/th")

headers = [re.sub('\s+',' ',el.text) for el in courses_subheading]

print(headers)

输出 :


['Course Weight', 'Title', 'Notes', 'Max Credits', 'OK Through', 'Disability Course']



查看完整回答
反对 回复 2023-04-25
  • 2 回答
  • 0 关注
  • 132 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信