使用 Python 和 BeautifulSoup 从页面获取表信息

Html5

jeck猫 2023-12-19 16:08:16

我尝试从中获取信息的页面是https://www.pro-football-reference.com/teams/crd/2017_roster.htm< a i=2>.我正在尝试从“名册”中获取所有信息表，但由于某种原因我无法通过 BeautifulSoup 获取它。我已经尝试过 soup.find("div", {'id': 'div_games_played_team'}) 但它不起作用。当我查看页面的 HTML 时，我可以看到一个非常大的注释和常规 div 中的表格。我怎样才能使用BeautifulSoup从这个表中获取信息？

查看完整描述

2 回答

30秒到达战场

TA贡献1828条经验获得超6个赞

你不需要硒。您可以做的（并且您正确识别了它）是提取注释，然后从其中解析表格。

import requests

from bs4 import BeautifulSoup

from bs4 import Comment

import pandas as pd

url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []

for each in comments:

if 'table' in each:

try:

tables.append(pd.read_html(each)[0])

except ValueError as e:

print(e)

continue

输出：

print(tables[0].head().to_string())

No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr) Salary

0 54.0 Bryson Albright 23.0 NaN 7 0.0 245.0 6-5 Miami (OH) 3/15/1994 1 0.0 NaN $246,177

1 36.0 Budda Baker*+ 21.0 ss 16 7.0 195.0 5-10 Washington 1/10/1996 Rook 9.0 Arizona Cardinals / 2nd / 36th pick / 2017 $465,000

2 64.0 Khalif Barnes 35.0 NaN 3 0.0 320.0 6-6 Washington 4/21/1982 12 0.0 Jacksonville Jaguars / 2nd / 52nd pick / 2005 $176,471

3 41.0 Antoine Bethea 33.0 db 15 6.0 206.0 5-11 Howard 7/27/1984 11 4.0 Indianapolis Colts / 6th / 207th pick / 2006 $2,000,000

4 28.0 Justin Bethel 27.0 rcb 16 6.0 200.0 6-0 Presbyterian 6/17/1990 5 3.0 Arizona Cardinals / 6th / 177th pick / 2012 $2,000,000

....

反对回复 2023-12-19

慕无忌1623718

TA贡献1744条经验获得超4个赞

您尝试抓取的标签是由 JavaScript 动态生成的。您很可能使用请求来抓取 HTML。不幸的是 requests 不会运行 JavaScript，因为它将所有 HTML 作为原始文本提取。 BeautifulSoup 找不到该标签，因为它从未在您的抓取程序中生成。

反对回复 2023-12-19

热搜

最近搜索清空

使用 Python 和 BeautifulSoup 从页面获取表信息

使用 Python 和 BeautifulSoup 从页面获取表信息

2 回答

添加回答