1 回答

TA贡献1735条经验 获得超5个赞
你可以试试这个。
代码
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Albert_Einstein'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
#print(soup.prettify())
until_soup = soup.find('h1', class_='firstHeading', text='Albert Einstein').find_all_previous()[::-1][1:]
#a list of bs tag objects, print(type(until_soup[0]))
#print(until_soup)
output = ''.join([str(_) for _ in until_soup])
#output is no longer bs tag objects but strings, print(type(output))
#print(output)
我强烈建议使用 API 调用,如下所示,
import wikipediaapi
wiki_html = wikipediaapi.Wikipedia(language='en',extract_format=wikipediaapi.ExtractFormat.HTML)
p_html = wiki_html.page('Albert Einstein')
#print(p_html.text)
#it is a string type, you may perform regex matching until the heading you wanted
添加回答
举报