为了账号安全,请及时绑定邮箱和手机立即绑定

如何将网页上的所有文本抓取到 python 中的特定标题?

如何将网页上的所有文本抓取到 python 中的特定标题?

幕布斯6054654 2022-06-22 17:31:36
我正在尝试打印从网页开头到特定标题的网页中的所有文本。我想要那个网页中的所有文本直到那个标题,然后什么都没有。我试图运行的代码(python 3):import requestsimport bs4from bs4 import BeautifulSoupurlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'res = requests.get(urlpage)soup1 = (bs4.BeautifulSoup(res.text, 'lxml')).get_text() print(soup1)该代码具有以下输出:Albert Einstein - Wikipediadocument.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Albert_Einstein","wgTitle":"Albert Einstein","wgCurRevisionId":920687884,"wgRevisionId":920687884,"wgArticleId":736,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with missing ISBNs","Webarchive template wayback links","CS1 German-language sources (de)","CS1: Julian–Gregorian uncertainty","CS1 French-language sources (fr)","CS1 errors: missing periodical","CS1: long volume value","Wikipedia indefinitely semi-protected pages","Use American English from February 2019","All Wikipedia articles written in American English","Articles with short description","Good articles","Articles containing German-language text","Biography with signature","Articles with hCards","Articles with hAudio microformats","All articles with unsourced statements",
查看完整描述

1 回答

?
喵喔喔

TA贡献1735条经验 获得超5个赞

你可以试试这个。


代码


import requests

from bs4 import BeautifulSoup


url = 'https://en.wikipedia.org/wiki/Albert_Einstein'

res = requests.get(url)


soup = BeautifulSoup(res.text, 'lxml')

#print(soup.prettify())


until_soup = soup.find('h1', class_='firstHeading', text='Albert Einstein').find_all_previous()[::-1][1:]

#a list of bs tag objects, print(type(until_soup[0]))

#print(until_soup)


output = ''.join([str(_) for _ in until_soup])

#output is no longer bs tag objects but strings, print(type(output))

#print(output)

我强烈建议使用 API 调用,如下所示,


import wikipediaapi

wiki_html = wikipediaapi.Wikipedia(language='en',extract_format=wikipediaapi.ExtractFormat.HTML)

p_html = wiki_html.page('Albert Einstein')

#print(p_html.text)

#it is a string type, you may perform regex matching until the heading you wanted


查看完整回答
反对 回复 2022-06-22
  • 1 回答
  • 0 关注
  • 197 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号