首页猿问如何使用...

如何使用 BeautifulSoup 选择 <a> 标签周围的上下文词/字符？

Python

Smart猫小萌 2022-01-05 19:49:26

我正在使用 BeautifulSoup 从网络爬虫处理 HTML。该HTML通过过滤器运行的“简化”的HTML，剥离和更换标签，以使该文件仅包含<html>，body，<div>，和<a>标签和可见的文字。我目前有一个功能，可以从这些页面中提取 URL 和锚文本。除了这些，我还想<a>为每个链接提取标签前后的 N 个“上下文词” 。例如，如果我有以下文件：<html><body><div>This is <a href="www.example.com">a test</a><div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div></div></body></html>然后如果 N=8 我想为每个链接获得以下 8 个“上下文词”：'www.example.com' --> ('This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog')`'www.petfood.com' --> ('fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad')第一个链接 ( www.example.com) 在到达文档开头之前只有两个词，因此返回这两个词，以及<a>标签后面的 6以组成N=8. 另请注意，返回的单词跨越了<a>标签包含的边界<div>。第二个链接 ( www.petfood.com)N\2前面有= 4 个单词，后面有4 个单词，因此它们作为上下文返回。也就是说，如果可能的话，N 个单词会在<a>标记之前和之后的单词之间拆分。如果文本<div>与链接位于同一范围内，我知道如何执行此操作，但我无法弄清楚如何跨<div>边界执行此操作。基本上，为了提取“上下文词”，我想将文档视为只是一个带有链接的可见文本块，忽略包含的 div。如何<a>使用 BeautifulSoup提取这样的标签周围的文本？为简单起见，我什至会对只返回标记之前/之后可见文本的 N 个字符的答案感到满意（我可以自己处理标记化/拆分）。

查看完整描述

1 回答

qq_笑_17

TA贡献1818条经验获得超7个赞

这是一个函数，它将整个 HTML 代码和 N 作为输入，并且对于<a>元素的每次出现，创建一个元组，其中链接 URL 作为第一个元素，N 个上下文词的列表作为第二个元素。它返回列表中的元组。

def getContext(html,n):

output = []

soup = BeautifulSoup(html, 'html.parser')

for i in soup.findAll("a"):

n_side = int(n/2)

text = soup.text.replace('\n',' ')

context_before = text.split(i.text)[0]

words_before = list(filter(bool,context_before.split(" ")))

context_after = text.split(i.text)[1]

words_after = list(filter(bool,context_after.split(" ")))

if(len(words_after) >= n_side):

words_before = words_before[-n_side:]

words_after = words_after[:(n-len(words_before))]

else:

words_after = words_after[:n_side]

words_before = words_before[-(n-len(words_after)):]

output.append((i["href"], words_before + words_after))

return output

该函数使用 BeautifulSoup 解析 HTML，并找到所有<a>元素。对于每个结果，仅检索文本（使用soup.text）并去除任何换行符。然后使用链接文本将整个文本分成两部分。每一边都被解析为一个单词列表，过滤以去除任何空格，并进行切片以便最多提取 N 个上下文单词。

例如：

html = '''<html><body>

<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>

</div>

</body></html>'''

print(*getContext(html,8))

输出：

('www.example.com', ['This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog'])

('www.petfood.com', ['fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad'])

演示：https : //repl.it/@glhr/55609756-link-context

编辑：请注意，此实现的一个缺陷是它使用链接文本作为分隔符来区分before和after。如果链接文本在 HTML 文档中在链接本身之前的某处重复，则这可能是一个问题，例如。

一个简单的解决方法是向链接文本添加特殊字符以使其独一无二，例如：

def getContext(html,n):

output = []

soup = BeautifulSoup(html, 'html.parser')

for i in soup.findAll("a"):

i.string.replace_with(f"[[[[{i.text}]]]]")

# rest of code here

会<div>This test is <a href="www.example.com">test</a>变成<div>This test is <a href="www.example.com">[[[[test]]]]</a>.

反对回复 2022-01-05

1 回答
0 关注
274 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使用 BeautifulSoup 选择 <a> 标签周围的上下文词/字符？

如何使用 BeautifulSoup 选择 <a> 标签周围的上下文词/字符？

1 回答

添加回答