首页手记 Python 爬虫IP代理池的实现

Python 爬虫IP代理池的实现

标签：

Python

很多时候，如果要多线程的爬取网页，或者是单纯的反爬，我们需要通过代理IP来进行访问。下面看看一个基本的实现方法。

代理IP的提取，网上有很多网站都提供这个服务。基本上可靠性和银子是成正比的。国内提供的免费IP基本上都是没法用的，如果要可靠的代理只能付费；国外稍微好些，有些免费IP还是比较靠谱的。

网上随便搜索了一下，找了个网页，本来还想手动爬一些对应的IP，结果发现可以直接下载现成的txt文件
http://www.thebigproxylist.com/

下载之后，试试看用不同的代理去爬百度首页

#！/usr/bin/env python#! -*- coding:utf-8 -*-# Author: Yuan Liimport re,urllib.requestfp=open("c:\\temp\\thebigproxylist-17-12-20.txt",'r')lines=fp.readlines()for ip in lines:    try:            print("当前代理IP "+ip)            proxy=urllib.request.ProxyHandler({"http":ip})            opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)            urllib.request.install_opener(opener)            url="http://www.baidu.com"            data=urllib.request.urlopen(url).read().decode('utf-8','ignore')            print("通过")            print("-----------------------------")    except Exception as err:        print(err)        print("-----------------------------")fp.close()

结果如下：

C:\Python36\python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py当前代理IP 137.74.168.174:80通过-----------------------------当前代理IP 103.28.161.68:8080通过-----------------------------当前代理IP 91.151.106.127:53281HTTP Error 503: Service Unavailable-----------------------------当前代理IP 177.136.252.7:3128<urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>-----------------------------当前代理IP 47.89.22.200:80通过-----------------------------当前代理IP 118.69.61.57:8888HTTP Error 503: Service Unavailable-----------------------------当前代理IP 192.241.190.167:8080通过-----------------------------当前代理IP 185.124.112.130:80通过-----------------------------当前代理IP 83.65.246.181:3128通过-----------------------------当前代理IP 79.137.42.124:3128通过-----------------------------当前代理IP 95.0.217.32:8080<urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>-----------------------------当前代理IP 104.131.94.221:8080通过

不过上面这种方式只适合比较稳定的IP源，如果IP不稳定的话，可能很快对应的文本就失效了，最好可以动态地去获取最新的IP地址。很多网站都提供API可以实时地去查询
还是用刚才的网站，这次我们用API去调用，这里需要浏览器伪装一下才能爬取

#！/usr/bin/env python#! -*- coding:utf-8 -*-# Author: Yuan Liimport re,urllib.requestheaders=("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")opener=urllib.request.build_opener()opener.addheaders=[headers]#安装为全局urllib.request.install_opener(opener)data=urllib.request.urlopen("http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf").read().decode('utf8')ippool=data.split('\n')for ip in ippool:    ip=ip.split(',')[0]    try:            print("当前代理IP "+ip)            proxy=urllib.request.ProxyHandler({"http":ip})            opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)            urllib.request.install_opener(opener)            url="http://www.baidu.com"            data=urllib.request.urlopen(url).read().decode('utf-8','ignore')            print("通过")            print("-----------------------------")    except Exception as err:        print(err)        print("-----------------------------")fp.close()

结果如下：

C:\Python36\python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py当前代理IP 213.233.57.134:80HTTP Error 403: Forbidden-----------------------------当前代理IP 144.76.81.79:3128通过-----------------------------当前代理IP 45.55.132.29:53281HTTP Error 503: Service Unavailable-----------------------------当前代理IP 180.254.133.124:8080通过-----------------------------当前代理IP 5.196.215.231:3128HTTP Error 503: Service Unavailable-----------------------------当前代理IP 177.99.175.195:53281HTTP Error 503: Service Unavailable

因为直接for循环来按顺序读取文本实在是太慢了，我试着改成多线程来读取，这样速度就快多了

#！/usr/bin/env python#! -*- coding:utf-8 -*-# Author: Yuan Liimport threadingimport queueimport re,urllib.request#Number of threadsn_thread = 10#Create queuequeue = queue.Queue()class ThreadClass(threading.Thread):    def __init__(self, queue):        threading.Thread.__init__(self)                
super(ThreadClass, self).__init__()    #Assign thread working with queue        self.queue = queue    def run(self):        while True:        #Get from queue job            host = self.queue.get()            
print (self.getName() + ":" + host)            try:                # print("当前代理IP " + host)                proxy = urllib.request.ProxyHandler({"http": host})                opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)                urllib.request.install_opener(opener)                url = "http://www.baidu.com"                data = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')                print("通过")                print("-----------------------------")            
except Exception as err:                print(err)                
print("-----------------------------")            #signals to queue job is done            self.queue.task_done()#Create number processfor i in range(n_thread):    t = ThreadClass(queue)    t.setDaemon(True)    #Start thread    t.start()#Read file line by linehostfile = open("c:\\temp\\thebigproxylist-17-12-20.txt","r")for line in hostfile:    #Put line to queue    queue.put(line)#wait on the queue until everything has been processedqueue.join()

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕少森

手记
篇

粉丝

44

获赞与收藏

216

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 27659 933

Python 算法入门教程

15个小节 30542 1177

Python 进阶应用教程

38个小节 73644 1149

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

Python 爬虫IP代理池的实现

阅读免费教程