为了账号安全,请及时绑定邮箱和手机立即绑定

如何用Python获取网页的前3句话?

如何用Python获取网页的前3句话?

互换的青春 2023-10-25 10:48:03
我有一项作业,其中我能做的一件事就是找到网页的前 3 句话并显示它。查找网页文本很容易,但我在弄清楚如何找到前 3 个句子时遇到了问题。import requestsfrom bs4 import BeautifulSoupurl = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'res = requests.get(url)html_page = res.contentsoup = BeautifulSoup(html_page, 'html.parser')text = soup.find_all(text=True)output = ''blacklist = [      '[document]',      'noscript',      'header',      'html',      'meta',      'head',      'input',      'script']for t in text:  if (t.parent.name not in blacklist):    output += '{} '.format(t)tempout = output.split('.')for i in range(tempout):  if (i >= 3):    tempout.remove(i)output = '.'.join(tempout)print(output)
查看完整描述

3 回答

?
青春有我

TA贡献1784条经验 获得超8个赞

从文本中找出句子是很困难的。通常,您会查找可以完成句子的字符,例如“.”。和 '!'。但句点(“.”)可能出现在句子的中间,例如人名的缩写。我使用正则表达式来查找句点,后跟单个空格或字符串末尾,这适用于前三个句子,但不适用于任何任意句子。


import requests

from bs4 import BeautifulSoup

import re


url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'

res = requests.get(url)

html_page = res.content

soup = BeautifulSoup(html_page, 'html.parser')


paragraphs = soup.select('section.article_text p')

sentences = []

for paragraph in paragraphs:

    matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)

    needed = 3 - len(sentences)

    found = len(matches)

    n = min(found, needed)

    for i in range(n):

        sentences.append(matches[i])

    if len(sentences) == 3:

        break

print(sentences)

印刷:


['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]



查看完整回答
反对 回复 2023-10-25
?
忽然笑

TA贡献1806条经验 获得超5个赞

实际上使用beautify soup你可以通过类“article_text post”进行过滤,查看源代码:

myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)

并获取p元素的内部文本

用这个代替soup = BeautifulSoup(html_page, 'html.parser')


查看完整回答
反对 回复 2023-10-25
?
侃侃无极

TA贡献2051条经验 获得超10个赞

要抓取前三个句子,只需将这些行添加到您的代码中:


section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"


txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)


print(txt)

输出:


Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.

希望这有帮助!


查看完整回答
反对 回复 2023-10-25
  • 3 回答
  • 0 关注
  • 87 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信