首页猿问抓取主图像而不是缩略图

抓取主图像而不是缩略图

Python

一只斗牛犬 2023-06-20 16:33:30

import requestsroot_tag=["article", {"class":"sorted-article"}]image_tag=["img",{"":""},"src"]session = requests.Session()response = session.get("https://phys.org/earth-news/", headers=headers)webContent = response.contentfor div in all_tab_data: image_url = None div_img = str(div) match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img) if match!=None: image_url = match.group(0) else: image_url = div.find(image_tag[0],image_tag[1]).get(image_tag[2]) if image_url!=None: if image_url[0] == '/' and image_url[1] != '/': image_url = main_url + image_url我的图像 url 输出是output_url但图像的实际 url 是actual_url。我怎样才能抓取主图像？

查看完整描述

2 回答

吃鸡游戏

TA贡献1829条经验获得超7个赞

用于beautifulsoup抓取所有新闻内容以获取图像：

import requests

from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

with requests.Session() as session:

session.headers = headers

soup = BeautifulSoup(session.get("https://phys.org/earth-news/").text, "lxml")

news_list = [news_div.get("href") for news_div in soup.select('.news-link')]

for url in news_list:

soup = BeautifulSoup(session.get(url).text, "lxml")

img = soup.select_one(".article-img")

if img:

print(url, img.select_one('img').get("src"))

else:

print(url, "This news doesn't contain image")

反对回复 2023-06-20

慕神8447489

TA贡献1780条经验获得超1个赞

用于BeautifulSoup提取图像链接：

import requests

from bs4 import BeautifulSoup

url = 'https://phys.org/earth-news/'

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for img in soup.select('.sorted-article img[data-src]'):

print( img['data-src'].replace('/175u/', '/800/') )

印刷：

https://scx1.b-cdn.net/csz/news/800/2020/biofuels.jpg

https://scx1.b-cdn.net/csz/news/800/2020/waterscarcity.jpg

https://scx1.b-cdn.net/csz/news/800/2020/soilerosion.jpg

https://scx1.b-cdn.net/csz/news/800/2020/hydropowerdam.jpg

https://scx1.b-cdn.net/csz/news/800/2019/flood.jpg

https://scx1.b-cdn.net/csz/news/800/2018/1-emissions.jpg

https://scx1.b-cdn.net/csz/news/800/2020/globalforest.jpg

https://scx1.b-cdn.net/csz/news/800/2020/fleeingthecl.jpg

https://scx1.b-cdn.net/csz/news/800/2020/watersecurity.jpg

https://scx1.b-cdn.net/csz/news/800/2019/2-water.jpg

https://scx1.b-cdn.net/csz/news/800/2020/japaneseexpe.jpg

https://scx1.b-cdn.net/csz/news/800/2020/6-scientistsco.jpg

https://scx1.b-cdn.net/csz/news/800/2020/housescollap.jpg

https://scx1.b-cdn.net/csz/news/800/2020/soil.jpg

https://scx1.b-cdn.net/csz/news/800/2020/32-researcherst.jpg

https://scx1.b-cdn.net/csz/news/800/2020/2-nasatracking.jpg

https://scx1.b-cdn.net/csz/news/800/2020/thelargersec.jpg

https://scx1.b-cdn.net/csz/news/800/2020/4-nasasterrasa.jpg

https://scx1.b-cdn.net/csz/news/800/2020/howtorecycle.jpg

https://scx1.b-cdn.net/csz/news/800/2020/newtoolstrac.jpg

反对回复 2023-06-20

2 回答
0 关注
93 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

抓取主图像而不是缩略图

抓取主图像而不是缩略图

2 回答

添加回答