首页猿问网页抓取并将检索到的数据拆分为不同的行

网页抓取并将检索到的数据拆分为不同的行

Python

扬帆大鱼 2021-12-09 15:22:03

我正在尝试收集活动日期、时间和地点。他们成功地出来了，但后来对读者不友好。如何让日期、时间和地点分别显示，例如：- event Date: Time: Venue:- event Date: Time: Venue:我正在考虑拆分，但我最终得到了很多 []，这使它看起来更难看。我想过剥离但我的正则表达式但它似乎没有做任何事情。有什么建议？from urllib.request import urlopenfrom bs4 import BeautifulSoupimport reurl_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"response = urllib.request.urlopen(url_toscrape)info_type = response.info()responseData = response.read()soup = BeautifulSoup(responseData, 'lxml')events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})for first in events_absFirst: print('-',first.text.strip()) print (' ',date)for tr in soup.find_all("div",{"class":"ntu_event_detail"}): date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"}) events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"}) for events in events_absAll: events = events.text.strip() for date in date_absAll: date = date.text.strip('^Time.*') print ('-',events) print (' ',date)

查看完整描述

2 回答

Qyouu

TA贡献1786条经验获得超11个赞

您可以遍历div包含事件信息的s，存储结果，然后打印每个：

import requests, re

from bs4 import BeautifulSoup as soup

d = soup(requests.get('https://www.ntu.edu.sg/events/Pages/default.aspx').text, 'html.parser')

results = [[getattr(i.find('div', {'class':re.compile('ntu_event_summary_title_first|ntu_event_summary_title')}), 'text', 'N/A'), getattr(i.find('div', {'class':'ntu_event_summary_detail'}), 'text', 'N/A')] for i in d.find_all('div', {'class':'ntu_event_articles'})]

new_results = [[a, re.findall('Date : .*?(?=\sTime)|Time : .*?(?=Venue)|Time : .*?(?=$)|Venue: [\w\W]+', b)] for a, b in results]

print('\n\n'.join('-{}\n{}'.format(a, '\n'.join(f' {h}:{i}' for h, i in zip(['Date', 'Time', 'Venue'], b))) for a, b in new_results))

输出：

-7th ASEF Rectors' Conference and Students' Forum (ARC7)

Date:Date : 29 Nov 2018 to 14 May 2019

Time:Time : 9:00am to 5:00pm

-Be a Youth Corps Leader

Date:Date : 1 Dec 2018 to 31 Mar 2019

Time:Time : 9:00am to 5:00pm

-NIE Visiting Artist Programme January 2019

Date:Date : 14 Jan 2019 to 11 Apr 2019

Time:Time : 9:00am to 8:00pm

Venue:Venue: NIE Art gallery

-Exercise Classes for You: Healthy Campus@NTU

Date:Date : 21 Jan 2019 to 18 Apr 2019

Time:Time : 6:00pm to 7:00pm

Venue:Venue: The Wave @ Sports & Recreation Centre

-[eLearning Course] Information & Media Literacy (From January 2019)

Date:Date : 23 Jan 2019 to 31 May 2019

Time:Time : 9:00am to 5:00pm

Venue:Venue: NTULearn

...

反对回复 2021-12-09

米脂

TA贡献1836条经验获得超3个赞

您可以使用请求并测试 stripped_strings 的长度

import requests

from bs4 import BeautifulSoup

import pandas as pd

url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"

response = requests.get(url_toscrape)

soup = BeautifulSoup(response.content, 'lxml')

events = [item.text for item in soup.select("[class^='ntu_event_summary_title']")]

data = soup.select('.ntu_event_summary_date')

dates = []

times = []

venues = []

for item in data:

strings = [string for string in item.stripped_strings]

if len(strings) == 3:

dates.append(strings[0])

times.append(strings[1])

venues.append(strings[2])

elif len(strings) == 2:

dates.append(strings[0])

times.append(strings[1])

venues.append('N/A')

elif len(strings) == 1:

dates.append(strings[0])

times.append('N/A')

venues.append('N/A')

results = list(zip(events, dates, times, venues))

df = pd.DataFrame(results)

print(df)

反对回复 2021-12-09

2 回答
0 关注
275 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

网页抓取并将检索到的数据拆分为不同的行

网页抓取并将检索到的数据拆分为不同的行

2 回答

添加回答