为了账号安全,请及时绑定邮箱和手机立即绑定

如何通过 BeautifulSoup 提取正文段落?

如何通过 BeautifulSoup 提取正文段落?

呼如林 2022-12-20 12:33:12
我正在尝试使用 BeautifulSoup 从网站中提取文本,但愿意探索其他选项。目前我正在尝试使用这样的东西:from bs4 import BeautifulSoupfrom urllib.request import Request, urlopenboston_url = 'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'hdr = {'User-Agent': 'Mozilla/5.0'}req = Request(boston_url,headers=hdr)webpage = urlopen(req)htmlText = webpage.read().decode('utf-8')pageText = BeautifulSoup(htmlText, "html.parser")body = pageText.find_all(text=True)目标是弄清楚如何提取红色框中的文本。您可以看到我从下面的 CMD 照片中获得的输出。它非常混乱,我不确定如何从中找到正文段落。我可以遍历输出并查找某些词,但我需要对多个站点执行此操作,而且我不知道正文段落中的内容。
查看完整描述

2 回答

?
HUX布斯

TA贡献1876条经验 获得超6个赞

它可能比你做的更简单。让我们尝试简化它:


import requests

from bs4 import BeautifulSoup as bs

boston_url = 'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'

hdr = {'User-Agent': 'Mozilla/5.0'}

req = requests.get(boston_url,headers=hdr)


soup = bs(req.text,'lxml')

soup.select('main main div.ma__rich-text>p')[0].text

输出:


'PERAC has not reviewed the RFP notices or other related materials posted on this page for compliance with M.G.L. Chapter 32, section 23B. The publication of these notices should not be interpreted as an indication that PERAC has made a determination as to that compliance.'


查看完整回答
反对 回复 2022-12-20
?
慕姐8265434

TA贡献1813条经验 获得超2个赞

您可以使用bs.find('p', text=re.compile('PERAC'))来提取该段落:


from bs4 import BeautifulSoup

import requests

import re


headers = {

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '

    'AppleWebKit/537.36 (KHTML, like Gecko) '

    'Chrome/83.0.4103.61 Safari/537.36'

}


boston_url = (

     'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'

)


resp = requests.get(boston_url, headers=headers)

bs = BeautifulSoup(resp.text)

bs.find('p', text=re.compile('PERAC'))


查看完整回答
反对 回复 2022-12-20
  • 2 回答
  • 0 关注
  • 144 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号