Python开发简单爬虫_学习笔记

首页免费课 Python开发简单爬虫笔记

Python开发简单爬虫

最热最新

UFO2015 04:47

# coding:utf8
__author__ = 'xray'
import urllib2
import cookielib

url = "https://rollbar.com/docs/"

print '第一种方法'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print '第二种方法'
request = urllib2.Request(url)
request.add_header("user-agent", "Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print response2.read()

print '第三种方法'
cj = cookielib.CookiJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

查看全部

1 采集收起来源：Python爬虫urlib2实例代码演示

2020-08-06

慕后端7165360

import urllib2
url = "www.baidu.com"
response1 = urllib1.urlopen(url)
print response1.getcode()
print len(response1.read())
print "第二种方法"
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response1.getcode()
print len(response1.read())

查看全部

1 采集收起来源：Python爬虫urlib2实例代码演示
2020-04-11
One2469170 03:35

HTTPCookieProcessor
ProxyHandler
HTTPSHandler
HTTPRedirectHandler
使用以上方法可以模拟登陆/herder头等参数

查看全部

1 采集收起来源：Python爬虫urlib2下载器网页的三种方法
2020-01-31
慕前端0201033

URL管理器
网页下载器（urllib2）
网页解析器（BeautifulSoup）

查看全部

1 采集收起来源：Python开发简单爬虫课程介绍
2020-01-18
Miller_Xu
import re #导入正则表达式要用的模块

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""

soup=BeautifulSoup(html_doc,'html.parser') #(文档字符串，解析器)

print('获取所有链接：')
links=soup.find_all('a')
for link in links:
print(link.name,link['href'],link.get_text()) #（名称，URL，文字）

print('获取指定链接(获取Lacie链接)：')
#link_node=soup.find('a',id="link2") 运行结果一样
link_node=soup.find('a',href='http://example.com/lacie') #注意find和find_all
print(link_node.name,link_node['href'],link_node.get_text())

print('输入正则模糊匹配出需要的内容：')
link_node=soup.find('a',href=re.compile(r"ill")) #'r'表示正则中出现反斜线时，我们只需写一个反斜线，否则我们要写两个
print(link_node.name,link_node['href'],link_node.get_text())

print('输入p这个段落文字(指定class获取内容)：')
p_node=soup.find('p',class_="story")
print(p_node.name,p_node.get_text())

输出：
```
获取所有链接：
a http://example.com/elsie Elsie
a http://example.com/lacie Lacie
a http://example.com/tillie Tillie
获取指定链接(获取Lacie链接)：
a http://example.com/lacie Lacie
输入正则模糊匹配出需要的内容：
a http://example.com/tillie Tillie
输入p这个段落文字(指定class获取内容)：
p Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
```
查看全部

1 采集收起来源：BeautifulSoup实例测试
2019-09-17
我要吃肉123_ 03:06

Python 3:
# coding:utf-8
import urllib
from http import cookiejar

url = "http://www.baidu.com"

print("第一种方法")
response1 = urllib.request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print("第二种方法")
request = urllib.request.Request(url)
request.add_header("user-agent", "Mozilla/5.0")
response2 = urllib.request.urlopen(url)
print(response2.getcode())
print(len(response2.read()))

print("第三种方法")
cj= cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(url)
print(response3.getcode())
print(cj)
print(len(response3.read()))

查看全部

1 采集收起来源：Python爬虫urlib2实例代码演示
2019-08-31
qq_慕码人328674

分析目标：1、URL格式（页面的入口）
2、数据格式（要抓取的内容的数据格式，主要是查看他类和标签等）
3、网页编码（如UTF-8）

查看全部

1 采集收起来源：Python爬虫实例-分析目标
2019-07-05
要努力的L

使用python3
1、
from urllib import request as urllib2
import http.cookiejar
2、
所有的print语句后需要加（）
3、
cj=http.cookiejar.CookieJar()

查看全部

1 采集收起来源：Python爬虫urlib2实例代码演示
2019-06-20
要努力的L

urllib2的三种下载网页的方法：
1、给定URL，传送给urllib2.urlopen(url),实现网页的下载。对应代码为：
import urllib2
#直接请求
response=urllib2.urlopen('http://www.baidu.com')
#获取状态码，如果是200表示获取成功
print response.getcode()
#读取内容
cont=response.read()
2、添加data（向服务器提交需要用户输入的）、http header(向服务器提交http的头信息)此时urllib2.urlopen以request作为参数发送网页请求，对应代码：
import urllib2
#创建request对象
request=urllib2.request(url)
#添加数据例如赋值a=1
request.add-data('a'，‘1’)
#添加http的header 例如把爬虫伪装成Mozilla的浏览器
request.add-header(‘User-Agent’，‘Mozilla/5.0’)
#发送请求获取结果
response=urllib2.urlopen(request)
3、例如有些网页需要用户的登陆才能访问，需要添加cookie的处理，使用HTTPCookieProcessor;例如有些网页需要代理才能访问，使用ProxyHandler ; 例如有些网页的协议是使用HTTPS加密访问，使用HTTPSHandler ；例如有些网页的URL可以自动的相互跳转关系，使用HTTPRedirectHandler。这些handler传送给urllib2 build opener.因此urllib2具有场景处理能力。然后依然使用urllib2的urlopen的方法来存储一个URL/request来实现网页的下载。

对应代码：（例如增强cookie的处理）
#导入urllib2和cookielib这两个模块
import urllib2，cookielib
#创建cookie容器
cj=cookielib.CookieJar()
#以Cookie Jar为参数生成一个headler，再传给urllib2.build-opener方法来生成一个opener对象。就是创建opener。
opener=urllib2.build-opener(urllib2.HTTPCookieProcessor(cj))
#给urllib2安装opener
urllib2.install-opener(opener)
#使用带有cookie的Urllib2访问网页
response=urllib2.urlopen("http://www.baidu.com")

查看全部

2 采集收起来源：Python爬虫urlib2下载器网页的三种方法
2019-06-20
要努力的L

URL管理器的作用：防止重复抓取、循环抓取。
URL管理器的功能:
3、判断是否还有带爬取URL
4、获取带爬取URL
1、判断待添加的URL是否本来就在容器中
2、添加心得URL到待爬取集合
5、将URL从待爬取移动到已爬取集合

查看全部

1 采集收起来源：Python爬虫URL管理
2019-06-20

weixin_慕神607169

import urllib.request
from http import cookiejar

url = 'http://www.baidu.com'
print('第一种方法')
response1 = urllib.request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('第二种方法')
request = urllib.request.Request(url)
request.add_header('user-agent','Mozilla/5.0')
response2 = urllib.request.urlopen(request)
print(response2.getcode())
print(len(response2.read()))

print('第三种方法')
cj = cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(url)
print(response3.getcode())
print(cj)
print(response3.read())

查看全部

1 采集收起来源：Python爬虫urlib2实例代码演示

2019-06-14

霜花似雪

五个模块：
爬虫总调度程序 spider_main url
管理器 url_manage
网页下载器 html_downloader
网页解析器 html_parser
输出 html_outputer

查看全部

1 采集收起来源：调度程序
2019-05-25
霜花似雪 05:05

确定目标--->分析目标(URL格式，数据格式，网页编码)--->编写代码
分析目标：制定抓取网站数据的策略；
URL格式：用来限定我们要抓取的页面的范围，如果范围不进行限定的话就会抓取去多不相关的网页，造成资源浪费。
数据格式：分析每个词条页面的标题等，数据所在的标签的格式。
网页编码：在代码解析器上指定网页的编码格式，然后才能进行正确的解析。

词条页面URL不是一个完整的URL，所以需要在代码中补全；
数据格式在<h1>标签中，

查看全部

1 采集收起来源：Python爬虫实例-分析目标
2019-05-25

慕娘1129934

# !/usr/bin/python
# -*-coding:utf-8-*-

import urllib
from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen("http://src.51elab.com")
html = response.read()
data = html.decode('utf-8')
soup = BeautifulSoup(data)
# print soup.findAll('span')


for item in soup.find_all("a"):
    if item.string == None:
        continue
    else:
        # print type(item.string)
        #print item.string+":"+item.get("href")
        print(item.string,":",item.get("href"))

python3上爬取网页内容并显示

查看全部

1 采集收起来源：BeautifulSoup实例测试

2019-04-07

Writebug
1. 今天的学习成果，编写主程序，url管理器，网页解析，下载，输出器。
2. 使用的是python3.6 遇到的问题以下是使用到的导包
3. ```
import re，import urllib.request，from urllib.parse import urljoin，from bs4 import BeautifulSoup
```
4. output.html 使用encode("utf-8")乱码，去掉后可以添加指定网页编码。解决乱码问题
查看全部

1 采集收起来源：开始运行爬虫和爬取结果展示
2019-03-26

首页上一页 3 4 5 6 7 8 9 下一页尾页

0/150

提交

取消

该课程已下架

课程须知: 本课程是Python语言开发的高级课程 1、Python编程语法； 2、HTML语言基础知识； 3、正则表达式基础知识；

老师告诉你能学到什么？: 1、爬虫技术的含义和存在价值 2、爬虫技术架构 3、组成爬虫的关键模块：URL管理器、HTML下载器和HTML解析器 4、实战抓取百度百科1000个词条页面数据的抓取策略设定、实战代码编写、爬虫实例运行 5、一套极简的可扩展爬虫代码，修改本代码，你就能抓取任何互联网网页！

微信扫码，参与3人拼团

热搜

最近搜索清空

Python开发简单爬虫