1 回答
TA贡献1798条经验 获得超3个赞
我认为您的抓取逻辑是正确的,但是在您的循环中,您每次都在执行 GET + POST,而您应该第一次执行 GET,然后为下一次迭代发出 POST(如果您想要 1 次迭代 = 1 页)
一个例子 :
import requests
from bs4 import BeautifulSoup
res_url = 'https://www.brcdirectory.com/InternalSite//Siteresults.aspx?'
params = {
'CountryId': '0',
'CategoryId': '49bd499b-bc70-4cac-9a29-0bd1f5422f6f',
'StandardId': '972f3b26-5fbd-4f2c-9159-9a50a15a9dde'
}
max_page = 20
def extract(page, soup):
for item_link in soup.select("h4 a.colorBlue"):
print("for page {} - {}".format(page, item_link.get("href")))
def build_payload(page, soup):
payload = {}
for input_item in soup.select("input"):
payload[input_item["name"]] = input_item["value"]
payload["__EVENTTARGET"]="ctl00$ContentPlaceHolder1$gv_Results"
payload["__EVENTARGUMENT"]="Page${}".format(page)
payload["ctl00$ContentPlaceHolder1$ddl_SortValue"] = "SiteName"
return payload
with requests.Session() as s:
for page in range(1, max_page):
if (page > 1):
req = s.post(res_url, params = params, data = build_payload(page, soup))
else:
req = s.get(res_url,params=params)
soup = BeautifulSoup(req.text,"lxml")
extract(page, soup)
添加回答
举报
