写文章

首页手记学以致用:Python爬取廖大Python教程制作pdf

学以致用:Python爬取廖大Python教程制作pdf

标签：

Python

收藏

当我学了廖大的Python教程后感觉总得做点什么正好自己想随时查阅于是就开始有了制作PDF这个想法。

想要把教程变成PDF有三步

先生成空html爬取每一篇教程放进一个新生成的div这样就生成了包含所有教程的html文件(BeautifulSoup)
将html转换成pdf(wkhtmltopdf)
由于廖大是写教程的反爬做的比较好在爬取的过程中还需要代理ip(免费 or 付费)

BeautifulSoup

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

安装

pip3 install BeautifulSoup4

开始使用

将一段文档传入 BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.

如下所示

from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")

首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码.
然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

对象的种类

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构,每个节点都是 Python 对象,所有对象可以归纳为 4 种: Tag , NavigableString , BeautifulSoup , Comment .

Tag通俗点讲就是 HTML 中的一个个标签类似 divp。
NavigableString获取标签内部的文字如soup.p.string。
BeautifulSoup表示一个文档的全部内容。
CommentComment 对象是一个特殊类型的 NavigableString 对象其输出的内容不包括注释符号.

Tag

Tag就是html中的一个标签用BeautifulSoup就能解析出来Tag的具体内容具体的格式为soup.name,其中name是html下的标签具体实例如下

print soup.title输出title标签下的内容包括此标签这个将会输出
```
<title>The Dormouse's story</title>
```

print soup.head输出head标签下的内容

<head><title>The Dormouse's story</title></head>

如果 Tag 对象要获取的标签有多个的话它只会返回所以内容中第一个符合要求的标签。

Tag 属性

每个 Tag 有两个重要的属性 name 和 attrs

name对于Tag它的name就是其本身如soup.p.name就是p
attrs是一个字典类型的对应的是属性-值如print soup.p.attrs,输出的就是{'class': ['title'], 'name': 'dromouse'},当然你也可以得到具体的值如print soup.p.attrs['class'],输出的就是[title]是一个列表的类型因为一个属性可能对应多个值,当然你也可以通过get方法得到属性的如print soup.p.get('class')。还可以直接使用print soup.p['class']

get

get方法用于得到标签下的属性值注意这是一个重要的方法在许多场合都能用到比如你要得到<img class="lazyload" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC" data-original="#">标签下的图像url,那么就可以用soup.img.get('src'),具体解析如下

# 得到第一个p标签下的src属性 print soup.p.get("class")

string

得到标签下的文本内容只有在此标签下没有子标签或者只有一个子标签的情况下才能返回其中的内容否则返回的是None具体实例如下

# 在上面的一段文本中p标签没有子标签因此能够正确返回文本的内容 print soup.p.string # 这里得到的就是None,因为这里的html中有很多的子标签 print soup.html.string

`get_text()`

可以获得一个标签中的所有文本内容包括子孙节点的内容这是最常用的方法。

搜索文档树

BeautifulSoup 主要用来遍历子节点及子节点的属性通过Tag取属性的方式只能获得当前文档中的第一个 tag例如soup.p。如果想要得到所有的<p> 标签,或是通过名字得到比一个 tag 更多的内容的时候,就需要用到 find_all()

find_all(name, attrs, recursive, text, **kwargs )

find_all是用于搜索节点中所有符合过滤条件的节点。

name参数是Tag的名字如p,div,title

# 1. 节点名 print(soup.find_all('p')) # 2. 正则表达式 print(soup.find_all(re.compile('^p'))) # 3. 列表   print(soup.find_all(['p', 'a']))

另外 attrs 参数可以也作为过滤条件来获取内容而 limit 参数是限制返回的条数。

CSS 选择器

以 CSS 语法为匹配标准找到 Tag。同样也是使用到一个函数该函数为select()返回类型是 list。它的具体用法如下

# 1. 通过 tag 标签查找 print(soup.select(head)) # 2. 通过 id 查找 print(soup.select('#link1')) # 3. 通过 class 查找 print(soup.select('.sister')) # 4. 通过属性查找 print(soup.select('p[name=dromouse]')) # 5. 组合查找 print(soup.select("body p"))

wkhtmltopdf

wkhtmltopdf主要用于HTML生成PDF。
pdfkit是基于wkhtmltopdf的python封装支持URL本地文件文本内容到PDF的转换其最终还是调用wkhtmltopdf命令。

安装

先安装wkhtmltopdf再安装pdfkit。

https://wkhtmltopdf.org/downloads.html
pdfkit
shell pip3 install pdfkit

转换url/file/string

import pdfkit pdfkit.from_url('http://google.com', 'out.pdf') pdfkit.from_file('index.html', 'out.pdf') pdfkit.from_string('Hello!', 'out.pdf')

转换url或者文件名列表

pdfkit.from_url(['google.com', 'baidu.com'], 'out.pdf') pdfkit.from_file(['file1.html', 'file2.html'], 'out.pdf')

转换打开文件

with open('file.html') as f:     pdfkit.from_file(f, 'out.pdf')

自定义设置

options = {     'page-size': 'Letter',     'margin-top': '0.75in',     'margin-right': '0.75in',     'margin-bottom': '0.75in',     'margin-left': '0.75in',     'encoding': "UTF-8",     'custom-header' : [         ('Accept-Encoding', 'gzip')     ]     'cookie': [         ('cookie-name1', 'cookie-value1'),         ('cookie-name2', 'cookie-value2'),     ],     'no-outline': None,     'outline-depth': 10, } pdfkit.from_url('http://google.com', 'out.pdf', options=options)

使用代理ip

爬取十几篇教程之后触发了这个错误

看来廖大的反爬虫做的很好于是只好使用代理ip了尝试了免费的西刺免费代理后最后选择了付费的阿布云感觉响应速度和稳定性还OK。

运行结果

运行过程截图

生成的效果图

代码如下

import time import pdfkit import requests from bs4 import BeautifulSoup # 使用 阿布云代理  # 可以选择不使用或是其他代理 def get_soup(target_url):     proxy_host = "http-dyn.abuyun.com"     proxy_port = "9020"     proxy_user = "你的用户"     proxy_pass = "你的密码"     proxy_meta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {         "host": proxy_host,         "port": proxy_port,         "user": proxy_user,         "pass": proxy_pass,     }     proxies = {         "http": proxy_meta,         "https": proxy_meta,     }     headers = {'User-Agent':                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}     flag = True     while flag:         try:             resp = requests.get(target_url, proxies=proxies, headers=headers)             flag = False         except Exception as e:             print(e)             time.sleep(0.4)     soup = BeautifulSoup(resp.text, 'html.parser')     return soup def get_toc(url):     soup = get_soup(url)     toc = soup.select("#x-wiki-index a")     print(toc[0]['href'])     return toc # 教程html def download_html(url, depth):     soup = get_soup(url)     # 处理目录     if int(depth) <= 1:         depth = '1'     elif int(depth) >= 2:         depth = '2'     title = soup.select(".x-content h4")[0]     new_title = BeautifulSoup('<h' + depth + '>' + title.string + '</h' + depth + '>', 'html.parser')     print(new_title)     # 加载图片     images = soup.find_all('img')     for x in images:         x['src'] = x['data-src']     div_content = soup.find('div', class_='x-wiki-content')     return new_title, div_content def convert_pdf(template):     html_file = "python-tutorial-pdf.html"     with open(html_file, mode="w", encoding="utf8") as code:         code.write(str(template))     pdfkit.from_file(html_file, 'python-tutorial-pdf.pdf') if __name__ == '__main__':     # html 模板     template = BeautifulSoup(         '<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <link rel="stylesheet" href="https://cdn.liaoxuefeng.com/cdn/static/themes/default/css/all.css?v=bc43d83"> <script class="lazyload" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC" data-original="https://cdn.liaoxuefeng.com/cdn/static/themes/default/js/all.js?v=bc43d83"></script> </head> <body> </body> </html>',         'html.parser')     # 教程目录     toc = get_toc('https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000')     for i, x in enumerate(toc):         url = 'https://www.liaoxuefeng.com' + x['href']         # 教程html         content = download_html(url, x.parent['depth'])         # 往template添加新的教程         new_div = template.new_tag('div', id=i)         template.body.insert(3 + i, new_div)         new_div.insert(3, content[0])         new_div.insert(3, content[1])         time.sleep(0.4)     convert_pdf(template)

参考文档

原文出处https://www.cnblogs.com/morethink/p/10252532.html

作者morethink

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

芜湖不芜

手记
篇

粉丝

75

获赞与收藏

334

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 24338 820

Python 算法入门教程

15个小节 25746 1012

Python 进阶应用教程

38个小节 61392 957

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


意见反馈分销返利帮助中心 APP下载
 官方微信
 返回顶部

举报

0/150

提交

取消