Python开发简单爬虫_学习笔记

首页免费课 Python开发简单爬虫笔记

Python开发简单爬虫

最热最新

没文化_很可怕

中国人民通过对 ASCII 编码的中文扩充改造，产生了 GB2312 编码，可以表示6000多个常用汉字。汉字实在是太多了，包括繁体和各种字符，于是产生了 GBK 编码，它包括了 GB2312 中的编码，同时扩充了很多。中国是个多民族国家，各个民族几乎都有自己独立的语言系统，为了表示那些字符，继续把 GBK 编码扩充为 GB18030 编码。每个国家都像中国一样，把自己的语言编码，于是出现了各种各样的编码，如果你不安装相应的编码，就无法解释相应编码想表达的内容。终于，有个叫 ISO 的组织看不下去了。他们一起创造了一种编码 UNICODE ，这种编码非常大，大到可以容纳世界上任何一个文字和标志。所以只要电脑上有 UNICODE 这种编码系统，无论是全球哪种文字，只需要保存文件的时候，保存成 UNICODE 编码就可以被其他电脑正常解释。 UNICODE 在网络传输中，出现了两个标准 UTF-8 和 UTF-16，分别每次传输 8个位和 16个位。于是就会有人产生疑问，UTF-8 既然能保存那么多文字、符号，为什么国内还有这么多使用 GBK 等编码的人？因为 UTF-8 等编码体积比较大，占电脑空间比较多，如果面向的使用人群绝大部分都是中国人，用 GBK 等编码也可以。但是目前的电脑来看，硬盘都是白菜价，电脑性能也已经足够无视这点性能的消耗了。所以推荐所有的网页使用统一编码：UTF-8。

查看全部

11 采集收起来源：开始运行爬虫和爬取结果展示
2016-03-13
慕尼黑1193012

from baike_spider import url_manager, html_downloader, html_parser,\ html_outputer class SpiderMain(): def __init__(self): self.urls=url_manager.UrlManager() self.downloader=html_downloader.HtmlDownloader() self.parser=html_parser.HtmlParser() self.outputer=html_outputer.HtmlOutputer() def craw(self,root_url): count=1 self.urls.add_new_url(root_url) while self.urls.has_new_url():#如果有待爬去的url new_url=self.urls.get_new_url()#取出一个 print 'craw %d:%s' %(count,new_url) html_cont=self.downloader.download(new_url) new_urls,new_data=self.parser.parse(new_url,html_cont) self.urls.add_new_urls(new_urls) self.outputer.collect_data(new_data) count=count+1 if count=1000 break self.outputer.output_html() if _name_=="__main__": root_url = "http://baike.baidu.com/view/21087.htm" obj_spider=SpiderMain() obj_spider.craw(root_url)

查看全部

10 采集收起来源：调度程序
2018-03-22
Cuqi

全部代码 http://www.imooc.com/opus/resource?opus_id=1932&tree=imooc%2Fbaike_spider

查看全部

8 采集收起来源：调度程序
2018-03-22

卷毛77

实例测试代码

import re

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup=BeautifulSoup(html_doc,'html.parser')

print('获取所有链接：')
links=soup.find_all('a')
for link in links:
    print(link.name,link['href'],link.get_text())

print('获取指定链接(获取Lacie链接)：')
# link_node=soup.find('a',id="link2")
link_node=soup.find('a',href='http://example.com/lacie')
print(link_node.name,link_node['href'],link_node.get_text())

print('输入正则模糊匹配出需要的内容：')
link_node=soup.find('a',href=re.compile(r"ill"))  #'r'表示正则中出现反斜线时，我们只需写一个反斜线，否则我们要写两个
print(link_node.name,link_node['href'],link_node.get_text())

print('输入p这个段落文字(指定class获取内容)：')
p_node=soup.find('p',class_="title")
print(p_node.name,p_node.get_text())

查看全部

7 采集收起来源：BeautifulSoup实例测试

2019-02-03

觉非夜

from bs4 import BeautifulSoup import re import urlparse class HtmlParser(object): def _get_new_urls(self, page_url, soup): new_urls = set() links = soup.find_all('a', href=re.compile(r"/view/\d+\.htm")) for link in links: new_url = link['href'] new_full_url = urlparse.urljoin(page_url, new_url) new_urls.add(new_full_url) return new_urls def _get_new_data(self, page_url, soup): res_data = {} res_data['url'] = page_url title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find("h1") res_data['title'] = title_node.get_text() summary_node = soup.find('div', class_="lemma-summary") res_data['summary'] = summary_node.get_text() return res_data def parse(self, page_url, html_cont): if page_url is None or html_cont is None: return soup = BeautifulSoup(html_cont, 'html.parse', from_encoding='utf-8') new_urls = self._get_new_urls(page_url, soup) new_data = self._get_new_data(page_url, soup) return new_urls, new_data

查看全部

6 采集收起来源：HTML解析器html_parser
2018-03-22
喵喵喵233

视频看完了要不要写一个试试看~

查看全部

5 采集收起来源：课程总结
2015-12-18
紫嫣yan 05:06

python3 在看同学笔记下完成 #coding=gbk #coding:UTF8 import urllib.request import http.cookiejar url="http://www.baidu.com" print("第一种方法") response1=urllib.request.urlopen(url) print(response1.getcode()) print(len(response1.read())) print("第二种方法") request=urllib.request.Request(url) request.add_header("user-agent", "Mozilla/5.0") response2=urllib.request.urlopen(request) print(response2.getcode()) print(len(response2.read())) print("第三种方法") #创建cookie容器 cj=http.cookiejar.CookieJar() #创建一个opener opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) #给urllib安装opener urllib.request.install_opener(opener) response3=urllib.request.urlopen(url) print(response3.getcode()) print(cj) #print(response3.read())

查看全部

5 采集收起来源：Python爬虫urlib2实例代码演示
2018-03-22
慕粉181004628

为什么output.html输出的是编码格式，不是中文输出非中文，如：\xc2\xa0\n\xef\xbc\x88\xe8\x8b\xb1\xe5\x9b\xbd\xe5\x8f\x91\xe9\x9f\xb3\xef\xbc\x9a/\xcb\x88pa\xc9\xaa\xce\xb8\xc9\x99n/ \xe7\xbe\x8e\xe5\x9b\xbd\xe5\x8f\x91\xe9\x9f\xb3\xef\xbc\x9a/\xcb\x88pa\xc9\xaa\xce\xb8\xc9\x91\xcb\x90n/\xef\xbc\x89，怎么解决看你提问时间感觉你应该在用python3，最后输出到html页面后标题和摘要两部分如果有中文显示出来都是 b'anfdsfsfds'这样的字节串，解决办法是在outputer.html中修改两个地方： 1. 打开文件时直接指定编码 fout = open('output.html', 'w', encoding='utf-8') 2.写如内容时取消编码 fout.write('<td>%s</td>' % data['title']) fout.write('<td>%s</td>' % data['summary'])

查看全部

5 采集收起来源：开始运行爬虫和爬取结果展示
2018-03-22
MOOCCCC

class UrlManager(object): def __init__(self): self.new_urls=set() self.old_urls=set() def add_new_url(self,url): if url is None: return if url not in self.new_urls and url not in self.old_urls: self.new_urls.add(url) def add_new_urls(self,urls): if urls is None or len(urls)==0: return for url in urls: self.add_new_url(url) def has_new_url(self): return len(self.new_urls) !=0 def get_new_url(self): new_url=self.new_urls.pop() self.old_urls.add(new_url) return

查看全部

4 采集收起来源：URL管理器
2016-03-27
GcsSloop

如果调用了len(response3.read())后再次调用response3.read()，就不会有输出了

查看全部

4 采集收起来源：Python爬虫urlib2实例代码演示
2016-01-05
凌雪舞罄

完整代码参考网站https://github.com/DaddySheng/Python_craw_test1/blob/master/Python3_craw_code.py
用浏览器查看输出乱码的，只要右键改下编码，改为自动选择就好了。因为默认的编码方式是GBK的

查看全部

4 采集收起来源：开始运行爬虫和爬取结果展示
2018-06-11
花宇

我用的Python版本是3.5，所以引入urllib2的时候出错了，因为3.5已经不用urllib2了，而是直接使用urllib，所以需要修改为： import urllib.requset response = urllib.request.urlopen(url)

查看全部

4 采集收起来源：HTML下载器html_downloader
2018-01-11
至繁归于至简_

现在网址已经变成http://baike.baidu.com/item/Python，我们抓这个新网址需要修改成这句links = soup.find_all('a', href=re.compile(r"/item/(.*)"))

查看全部

4 采集收起来源：调度程序
2018-03-22
流浪在海洋 01:06

python 3.x中urllib库和urilib2库合并成了urllib库。。其中urllib2.urlopen()变成了urllib.request.urlopen() urllib2.Request()变成了urllib.request.Request()

查看全部

4 采集收起来源：Python爬虫网页下载器简介
2016-07-31
觉非夜

from baike_spider import url_manager, html_downloader, html_parser,\ html_outputer class SpiderMain(): def __init__(self): self.urls=url_manager.UrlManager() self.downloader=html_downloader.HtmlDownloader() self.parser=html_parser.HtmlParser() self.outputer=html_outputer.HtmlOutputer() def craw(self,root_url): count=1 self.urls.add_new_url(root_url) while self.urls.has_new_url():#如果有待爬去的url new_url=self.urls.get_new_url()#取出一个 print 'craw %d:%s' %(count,new_url) html_cont=self.downloader.download(new_url) new_urls,new_data=self.parser.parse(new_url,html_cont) self.urls.add_new_urls(new_urls) self.outputer.collect_data(new_data) count=count+1 if count=1000 break self.outputer.output_html() if _name_=="__main__": root_url = "http://baike.baidu.com/view/21087.htm" obj_spider=SpiderMain() obj_spider.craw(root_url)

查看全部

4 采集收起来源：调度程序
2018-03-22

首页上一页1 2 3 4 5 6 7 下一页尾页

0/150

提交

取消

该课程已下架

课程须知: 本课程是Python语言开发的高级课程 1、Python编程语法； 2、HTML语言基础知识； 3、正则表达式基础知识；

老师告诉你能学到什么？: 1、爬虫技术的含义和存在价值 2、爬虫技术架构 3、组成爬虫的关键模块：URL管理器、HTML下载器和HTML解析器 4、实战抓取百度百科1000个词条页面数据的抓取策略设定、实战代码编写、爬虫实例运行 5、一套极简的可扩展爬虫代码，修改本代码，你就能抓取任何互联网网页！

微信扫码，参与3人拼团

热搜

最近搜索清空

Python开发简单爬虫