为了账号安全,请及时绑定邮箱和手机立即绑定

抓取主图像而不是缩略图

抓取主图像而不是缩略图

一只斗牛犬 2023-06-20 16:33:30
import requestsroot_tag=["article", {"class":"sorted-article"}]image_tag=["img",{"":""},"src"]session = requests.Session()response = session.get("https://phys.org/earth-news/", headers=headers)webContent = response.contentfor div in all_tab_data:    image_url = None    div_img = str(div)    match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)    if match!=None:        image_url = match.group(0)    else:        image_url = div.find(image_tag[0],image_tag[1]).get(image_tag[2])    if image_url!=None:        if image_url[0] == '/' and image_url[1] != '/':            image_url = main_url + image_url我的图像 url 输出是output_url但图像的实际 url 是actual_url。我怎样才能抓取主图像?
查看完整描述

2 回答

?
吃鸡游戏

TA贡献1829条经验 获得超7个赞

用于beautifulsoup抓取所有新闻内容以获取图像:


import requests

from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}


with requests.Session() as session:

    session.headers = headers

    soup = BeautifulSoup(session.get("https://phys.org/earth-news/").text, "lxml")

    news_list = [news_div.get("href") for news_div in soup.select('.news-link')]

    for url in news_list:

        soup = BeautifulSoup(session.get(url).text, "lxml")

        img = soup.select_one(".article-img")

        if img:

            print(url, img.select_one('img').get("src"))

        else:

            print(url, "This news doesn't contain image")


查看完整回答
反对 回复 2023-06-20
?
慕神8447489

TA贡献1780条经验 获得超1个赞

用于BeautifulSoup提取图像链接:


import requests

from bs4 import BeautifulSoup



url = 'https://phys.org/earth-news/'

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

    

for img in soup.select('.sorted-article img[data-src]'):

    print( img['data-src'].replace('/175u/', '/800/') )

印刷:


https://scx1.b-cdn.net/csz/news/800/2020/biofuels.jpg

https://scx1.b-cdn.net/csz/news/800/2020/waterscarcity.jpg

https://scx1.b-cdn.net/csz/news/800/2020/soilerosion.jpg

https://scx1.b-cdn.net/csz/news/800/2020/hydropowerdam.jpg

https://scx1.b-cdn.net/csz/news/800/2019/flood.jpg

https://scx1.b-cdn.net/csz/news/800/2018/1-emissions.jpg

https://scx1.b-cdn.net/csz/news/800/2020/globalforest.jpg

https://scx1.b-cdn.net/csz/news/800/2020/fleeingthecl.jpg

https://scx1.b-cdn.net/csz/news/800/2020/watersecurity.jpg

https://scx1.b-cdn.net/csz/news/800/2019/2-water.jpg

https://scx1.b-cdn.net/csz/news/800/2020/japaneseexpe.jpg

https://scx1.b-cdn.net/csz/news/800/2020/6-scientistsco.jpg

https://scx1.b-cdn.net/csz/news/800/2020/housescollap.jpg

https://scx1.b-cdn.net/csz/news/800/2020/soil.jpg

https://scx1.b-cdn.net/csz/news/800/2020/32-researcherst.jpg

https://scx1.b-cdn.net/csz/news/800/2020/2-nasatracking.jpg

https://scx1.b-cdn.net/csz/news/800/2020/thelargersec.jpg

https://scx1.b-cdn.net/csz/news/800/2020/4-nasasterrasa.jpg

https://scx1.b-cdn.net/csz/news/800/2020/howtorecycle.jpg

https://scx1.b-cdn.net/csz/news/800/2020/newtoolstrac.jpg


查看完整回答
反对 回复 2023-06-20
  • 2 回答
  • 0 关注
  • 93 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信