为了账号安全,请及时绑定邮箱和手机立即绑定

使用 BeautifulSoup 从 Tom Holland 的 IMDB 页面中提取角色角色

使用 BeautifulSoup 从 Tom Holland 的 IMDB 页面中提取角色角色

慕丝7291255 2023-04-18 17:09:12
我从 Tom Holland 的 IMDB 页面中提取了以下数据并将其定义为“movie_contents”:[<div class="filmo-row odd" id="actor-tt10872600"> <span class="year_column">  2021 </span> <b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b> (<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>) <br/> Peter Parker / Spider-Man </div>, <div class="filmo-row even" id="actor-tt1464335"> <span class="year_column">  2021 </span> <b><a href="/title/tt1464335/">Uncharted</a></b> (<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>) <br/> Nathan Drake </div>, <div class="filmo-row odd" id="actor-tt2076822"> <span class="year_column">  2021 </span> <b><a href="/title/tt2076822/">Chaos Walking</a></b> (<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>) <br/> Todd Hewitt </div>, <div class="filmo-row even" id="actor-tt9130508"> <span class="year_column">  2020/I </span> <b><a href="/title/tt9130508/">Cherry</a></b> (<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>) <br/> Nico Walker </div>, <div class="filmo-row odd" id="actor-tt7395114"> <span class="year_column">  2020 </span> <b><a href="/title/tt7395114/">The Devil All the Time</a></b> (<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>) <br/> Arvin Russell </div>, <div class="filmo-row even" id="actor-tt7146812"> <span class="year_column">  2020/I </span> <b><a href="/title/tt7146812/">Onward</a></b> <br/> Ian Lightfoot (voice) </div>, <div class="filmo-row odd" id="actor-tt6673612"> <span class="year_column">  2020 </span> <b><a href="/title/tt6673612/">Dolittle</a></b> <br/> Jip (voice) </div>我有问题如何提取所有角色名称“Peter Parker / Spider-Man”、“Nathan Drake”、“Todd Hewitt”等?
查看完整描述

2 回答

?
白板的微信

TA贡献1883条经验 获得超3个赞

该脚本将打印演员的所有角色:


import requests

from bs4 import BeautifulSoup



url = 'https://www.imdb.com/name/nm4043618/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')


seen = set()

for row in soup.select('#filmo-head-actor + div .filmo-row > br'):

    role = row.find_next(text=True).strip()

    if not role in seen:

        seen.add(role)

        print(role)

印刷:


Peter Parker / Spider-Man

Nathan Drake

Todd Hewitt

Nico Walker

Arvin Russell

Ian Lightfoot (voice)

Jip (voice)

Walter (voice)

Samuel Insull

Brother Diarmuid - The Novice

Jack Fawcett

Bradley Baker

Thomas Nickerson

Tom

Gregory Cromwell

Former Billy (Encore) (uncredited)

Isaac

Eddie (voice)

Boy

Lucas

Shô (UK version, voice)

编辑:要获得 DataFrame 的角色,您可以这样做:


import requests

import pandas as pd

from bs4 import BeautifulSoup



url = "https://www.imdb.com/name/nm4043618/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")


seen = set()

all_data = []

for row in soup.select("#filmo-head-actor + div .filmo-row > br"):

    role = row.find_next(text=True).strip()

    if not role in seen:

        seen.add(role)

        all_data.append(role)


df = pd.DataFrame(all_data, columns=["Role"])

print(df)

印刷:


                                  Role

0            Peter Parker / Spider-Man

1                         Nathan Drake

2                          Todd Hewitt

3                          Nico Walker

4                        Arvin Russell

5                Ian Lightfoot (voice)

6                          Jip (voice)

7                       Walter (voice)

8                        Samuel Insull

9        Brother Diarmuid - The Novice

10                        Jack Fawcett

11                       Bradley Baker

12                    Thomas Nickerson

13                                 Tom

14                    Gregory Cromwell

15  Former Billy (Encore) (uncredited)

16                               Isaac

17                       Eddie (voice)

18                                 Boy

19                               Lucas

20             Shô (UK version, voice)


查看完整回答
反对 回复 2023-04-18
?
HUX布斯

TA贡献1876条经验 获得超6个赞

尝试:


from bs4 import BeautifulSoup


html = '''<html>

 <div class="filmo-row odd" id="actor-tt10872600">

 <span class="year_column">

  2021

 </span>

 <b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)

 <br/>

 Peter Parker / Spider-Man

 </div>, <div class="filmo-row even" id="actor-tt1464335">

 <span class="year_column">

  2021

 </span>

 <b><a href="/title/tt1464335/">Uncharted</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)

 <br/>

 Nathan Drake

 </div>, <div class="filmo-row odd" id="actor-tt2076822">

 <span class="year_column">

  2021

 </span>

 <b><a href="/title/tt2076822/">Chaos Walking</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)

 <br/>

 Todd Hewitt

 </div>, <div class="filmo-row even" id="actor-tt9130508">

 <span class="year_column">

  2020/I

 </span>

 <b><a href="/title/tt9130508/">Cherry</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)

 <br/>

 Nico Walker

 </div>, <div class="filmo-row odd" id="actor-tt7395114">

 <span class="year_column">

  2020

 </span>

 <b><a href="/title/tt7395114/">The Devil All the Time</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)

 <br/>

 Arvin Russell

 </div>, <div class="filmo-row even" id="actor-tt7146812">

 <span class="year_column">

  2020/I

 </span>

 <b><a href="/title/tt7146812/">Onward</a></b>

 <br/>

 Ian Lightfoot (voice)

 </div>, <div class="filmo-row odd" id="actor-tt6673612">

 <span class="year_column">

  2020

 </span>

 <b><a href="/title/tt6673612/">Dolittle</a></b>

 <br/>

 Jip (voice)

 </div>

 '''

soup = BeautifulSoup(html, 'html.parser')



divs = soup.select('div.filmo-row.odd')

for div in divs:

    text = div.find_all(text=True, recursive=False)

    print(*[t.strip() for t in text if len(t) > 3])

印刷:


Peter Parker / Spider-Man

Todd Hewitt

Arvin Russell

Jip (voice)


查看完整回答
反对 回复 2023-04-18
  • 2 回答
  • 0 关注
  • 91 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信