首页猿问尝试从 DataFrame...

尝试从 DataFrame 中的源中删除 html 格式

Python

德玛西亚99 2023-09-12 19:04:16

我有一个包含推文来源的数据框。源的格式如下：<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>我正在尝试找到一种方法来剥离 html 并保留 url。我对正则表达式不太熟悉，无法真正找到解决方案。任何帮助都会很棒。

查看完整描述

3 回答

手掌心

TA贡献1942条经验获得超3个赞

您可以首先通过将标签设置为BeautifulSoup对象来获取 url 。如果它已经是一个 BeautifulSoup 对象那么你可以直接应用它

.find("a").get("href")

如果没有，那么您可以将其设为 BeautifulSoup 对象。

from bs4 import BeautifulSoup #pip install beautifulsoup4

a_tag ='<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

soup = BeautifulSoup(a_tag,"html5lib") #pip install html5lib

print(soup.find("a").get("href"))

#output - > http://twitter.com/download/iphone

然后用这个函数去掉html，文字就剩下了

import re

def remove_html_tags(raw_html):

cleanr = re.compile("<.*?>")

clean_text = re.sub(cleanr,'',raw_html)

return clean_text

output = remove_html_tags(a_tag)

print(output)

#output -> Twitter for iPhone

反对回复 2023-09-12

BIG阳

TA贡献1859条经验获得超6个赞

您可以使用 python urlextract模块从任何字符串中提取 URL -

from urlextract import URLExtract

text = '''

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>

'''

text = text.replace(' ', '').replace('=','')

extractor = URLExtract()

print(extractor.find_urls(text))

输出-

['http://twitter.com/download/iphone']

反对回复 2023-09-12

慕姐4208626

TA贡献1852条经验获得超7个赞

您可以拆分“”。并获取第二个元素。

.split('"')[1]

https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split

反对回复 2023-09-12

3 回答
0 关注
301 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

尝试从 DataFrame 中的源中删除 html 格式

尝试从 DataFrame 中的源中删除 html 格式

3 回答

添加回答