为了账号安全,请及时绑定邮箱和手机立即绑定

无法在Python中的Beautiful Soup中获取div标签,

无法在Python中的Beautiful Soup中获取div标签,

翻阅古今 2023-12-29 17:02:24
我正在尝试下载官方网站上提供的所有口袋妖怪图像。我这样做的原因是因为我想要高质量的图像。以下是我编写的代码。from bs4 import BeautifulSoup as bs4import requestsrequest = requests.get('https://www.pokemon.com/us/pokedex/')soup = bs4(request.text, 'html')print(soup.findAll('div',{'class':'container       pokedex'}))输出是[]我做错了什么吗?另外,从官方网站抓取合法吗?有没有任何标签或东西可以说明这一点?谢谢PS:我是 BS 和 html 的新手。
查看完整描述

2 回答

?
噜噜哒

TA贡献1784条经验 获得超7个赞

图像是动态加载的,因此您必须使用selenium它们来抓取它们。这是执行此操作的完整代码:


from selenium import webdriver

import time

import requests


driver = webdriver.Chrome()


driver.get('https://www.pokemon.com/us/pokedex/')


time.sleep(4)


li_tags = driver.find_elements_by_class_name('animating')[:-3]


li_num = 1


for li in li_tags:

    img_link = li.find_element_by_xpath('.//img').get_attribute('src')

    name = li.find_element_by_xpath(f'/html/body/div[4]/section[5]/ul/li[{li_num}]/div/h5').text


    r = requests.get(img_link)

    

    with open(f"D:\\{name}.png", "wb") as f:

        f.write(r.content)


    li_num += 1


driver.close()

输出:


12张口袋妖怪图片。这是前两张图片:


图片1:

https://img1.sycdn.imooc.com/658e8c5a0001006c02140216.jpg

图片2:

https://img1.sycdn.imooc.com/658e8c630001f21702170208.jpg

另外,我注意到页面底部有一个加载更多按钮。单击时,它会加载更多图像。单击“加载更多”按钮后,我们必须继续向下滚动才能加载更多图像。如果我没记错的话,网站上一共有 893 张图片。为了抓取所有 893 张图像,您可以使用以下代码:


from selenium import webdriver

import time

import requests


driver = webdriver.Chrome()


driver.get('https://www.pokemon.com/us/pokedex/')


time.sleep(3)


load_more = driver.find_element_by_xpath('//*[@id="loadMore"]')


driver.execute_script("arguments[0].click();",load_more)


lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")

match=False

while(match==False):

        lastCount = lenOfPage

        time.sleep(1.5)

        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")

        if lastCount==lenOfPage:

            match=True


li_tags = driver.find_elements_by_class_name('animating')[:-3]


li_num = 1


for li in li_tags:

    img_link = li.find_element_by_xpath('.//img').get_attribute('src')

    name = li.find_element_by_xpath(f'/html/body/div[4]/section[5]/ul/li[{li_num}]/div/h5').text


    r = requests.get(img_link)

    

    with open(f"D:\\{name}.png", "wb") as f:

        f.write(r.content)


    li_num += 1


driver.close()


查看完整回答
反对 回复 2023-12-29
?
元芳怎么了

TA贡献1798条经验 获得超7个赞

如果您首先检查网络选项卡,这可能会更容易完成:


import time

import requests



endpoint = "https://www.pokemon.com/us/api/pokedex/kalos"

# contains all metadata

data = requests.get(endpoint).json()


# collect keys needed to save the picture

items = [{"name": item["name"], "link": item["ThumbnailImage"]} for item in data]


# remove duplicates

d = [dict(t) for t in {tuple(d.items()) for d in items}]

assert len(d) == 893



for pokemon in d:

    response = requests.get(pokemon["link"])

    time.sleep(1)

    with open(f"{pokemon['name']}.png", "wb") as f:

        f.write(response.content)


查看完整回答
反对 回复 2023-12-29
  • 2 回答
  • 0 关注
  • 57 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信