首页猿问网页抓取文本返回一个空集

网页抓取文本返回一个空集

Python

德玛西亚99 2022-06-14 15:25:34

使用 Beautiful Soup FindAll 时，代码不会抓取文本，因为它返回一个空集。在此之后代码还有其他问题，但在这个阶段我正在尝试解决第一个问题。我对此很陌生，所以我理解代码结构可能不太理想。我来自 VBA 背景。import requestsfrom requests import getfrom selenium import webdriverfrom bs4 import BeautifulSoupfrom lxml import htmlimport pandas as pd#import chromedriver_binary # Adds chromedriver binary to pathoptions = webdriver.ChromeOptions()options.add_argument('--ignore-certificate-errors')options.add_argument('--incognito')options.add_argument('--headless')driver = webdriver.Chrome(executable_path=r"C:\Users\mmanenica\Documents\chromedriver.exe")#click the search button on Austenders to return all Awarded Contractsimport time#define the starting point: Austenders Awarded Contracts search pagedriver.get('https://www.tenders.gov.au/cn/search')#Find the Search Button and return all search resultsSearch_Results = driver.find_element_by_name("SearchButton")if 'inactive' in Search_Results.get_attribute('name'): print("Search Button not found") exit;print('Search Button found')Search_Results.click() #Pause code to prevent blocking by websitetime.sleep(1)i = 0Awarded = []#Move to the next search page by finding the Next button at the bottom of the page#This code will need to be refined as the last search will be skipped currently.while True: Next_Page = driver.find_element_by_class_name('next') if 'inactive' in Next_Page.get_attribute('class'): print("End of Search Results") exit; i = i + 1 time.sleep(2)

查看完整描述

1 回答

梦里花落0921

TA贡献1772条经验获得超6个赞

如前所述，您实际上并没有将 html 源代码输入 BeautifulSoup。所以首先改变的是soup = BeautifulSoup(driver.current_url, features='lxml')：soup = BeautifulSoup(driver.page_source, features='lxml')

第二个问题：有些元素没有<a>带有class=detail的标签。因此，您将无法从 NoneType 中获取 href。我添加了一个 try/except 以在发生这种情况时跳过（但不确定这是否会产生您想要的结果）。你也可以摆脱那个类，然后说Details_Page = each_Contract.find('a').get('href')

接下来，那只是url的扩展名，你需要追加根，所以： driver.get('https://www.tenders.gov.au' + Details_Page)

我也看不到您指的是哪里 class=Contact-Heading。

您还参考 class='class': 'list-desc-inner' 和一个点，然后 'class': 'list_desc_inner' 在另一个点。同样，我没有看到 class=list_desc_inner

下一个。将列表附加到列表中，您想要Awarded.append(Combined)，而不是Awarded.append[Combined]

我还在.strip()那里添加以清理文本中的一些空白。

无论如何，您需要修复和清理很多东西，而且我也不知道您的预期输出应该是什么。但希望这能让你开始。

此外，正如评论中所述，您可以单击下载按钮并立即获得结果，但也许您正在努力练习......

import requests

from requests import get

from selenium import webdriver

from bs4 import BeautifulSoup

from lxml import html

import pandas as pd

#import chromedriver_binary # Adds chromedriver binary to path

options = webdriver.ChromeOptions()

options.add_argument('--ignore-certificate-errors')

options.add_argument('--incognito')

options.add_argument('--headless')

driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")

#click the search button on Austenders to return all Awarded Contracts

import time

#define the starting point: Austenders Awarded Contracts search page

driver.get('https://www.tenders.gov.au/cn/search')

#Find the Search Button and return all search results

Search_Results = driver.find_element_by_name("SearchButton")

if 'inactive' in Search_Results.get_attribute('name'):

print("Search Button not found")

exit;

print('Search Button found')

Search_Results.click()

#Pause code to prevent blocking by website

time.sleep(1)

i = 0

Awarded = []

#Move to the next search page by finding the Next button at the bottom of the page

#This code will need to be refined as the last search will be skipped currently.

while True:

Next_Page = driver.find_element_by_class_name('next')

if 'inactive' in Next_Page.get_attribute('class'):

print("End of Search Results")

exit;

i = i + 1

time.sleep(2)

#Loop through all the Detail links on the current Search Results Page

print("Checking search results page " + str(i))

print(driver.current_url)

soup = BeautifulSoup(driver.page_source, features='lxml')

#Find all Contract detail links in the current search results page

Details = soup.findAll('div', {'class': 'list-desc-inner'})

for each_Contract in Details:

#Loop through each Contract details link and scrape all the detailed

#Contract information page

try:

Details_Page = each_Contract.find('a', {'class': 'detail'}).get('href')

driver.get('https://www.tenders.gov.au' + Details_Page)

#Scrape all the data in the Awarded Contract page

#r = requests.get(driver.current_url)

soup = BeautifulSoup(driver.page_source, features='lxml')

#find a list of all the Contract Info (contained in the the 'Contact Heading'

#class of the span element)

Contract = soup.find_all('span', {'class': 'Contact-Heading'})

Contract_Info = [span.text.strip() for span in Contract]

#find a list of all the Summary Contract info which is in the text of\

#the 'list_desc_inner' class

Sub = soup.find_all('div', {'class': 'list-desc-inner'})

Sub_Info = [div.text.strip() for div in Sub]

#Combine the lists into a unified list and append to the Awarded table

Combined = [Contract_Info, Sub_Info]

Awarded.append(Combined)

#Go back to the Search Results page (from the Detailed Contract page)

driver.back()

except:

continue

#Go to the next Search Page by clicking on the Next button at the bottom of the page

Next_Page.click()

time.sleep(3)

driver.close()

print(Awarded.Shape)

反对回复 2022-06-14

1 回答
0 关注
176 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

网页抓取文本返回一个空集

网页抓取文本返回一个空集

1 回答

添加回答