http_（wiki）搜索

请求已经被实现，而且有一个新的资源已经依据请求的需要而建立，且其 URI 已经随 Location 头信息返回。适用场景API 请求创建一个资源对象，返回了新资源对象的地址。目前开发中大部分是新增一个资源返回这个资源的 ID ，然后根据 ID 再查询详情。Http 的很多状态码都定很细，实践中并不都那么遵守理论。客户端POST /add-article HTTP/1.1Content-Type: application/json{ "article": "http" }服务端HTTP/1.1 201 CreatedLocation: /article/01

5.1 根据需求制定 RESTful 风格的接口文档

既然是要做商品浏览页面，将商品增删改查都实现了就是了。 RESTful 风格接口并不麻烦，一般情况下需要项目团队一起商量制定。此处我们指定如下：动词接口含义接口地址 GET 查询商品 (id=1) 信息 http://127.0.0.1:8080/goods/1GET 查询商品列表信息 http://127.0.0.1:8080/goodsPOST 新增商品 http://127.0.0.1:8080/goodsPUT 修改商品 (id=1) 信息 http://127.0.0.1:8080/goods/1DELETE 删除商品 (id=1)http://127.0.0.1:8080/goods/1Tips： RESTful 风格通过 HTTP 动词（ GET / POST / PUT / DELETE ）区分操作类型， URL 格式比较固定，仅此而已，非常简单。

下节预告

除了 HTML/CSSJS 的知识外，学习 Web 开发还需要对 HTTP 协议有一定的了解，HTTP 协议同样是 Web 开发必备基础知识，下节课我们就来学习下 HTTP 协议以及 HTTP 在 Web 开发中所起到的作用。不仅如此，下节课会给给大家进行一个 Web 开发常见概念的普及，让大家对 Web 开发有一个更清晰的了解。

ES6+ <a href="http://Object.is">Object.is</a>()

2. 在 Flask 中分析 URL 参数

服务端收到将客户端发送的数据后，封装形成一个请求对象，在 Flask 中，请求对象是一个模块变量 flask.request，request 对象包含了众多的属性。假设 URL 等于 http://localhost/query?userId=123，则与 URL 参数相关的属性如下：属性说明urlhttp://localhost/query?userId=123base_urlhttp://localhost/queryhostlocalhosthost_urlhttp://localhost/path/queryfull_path/query?userId=123下面编写一个 Flask 程序 request.py，打印 request 中和 URL 相关的属性：#!/usr/bin/python3from flask import Flaskfrom flask import requestapp = Flask(__name__)def echo(key, value): print('%-10s = %s' % (key, value))@app.route('/query')def query(): echo('url', request.url) echo('base_url', request.base_url) echo('host', request.host) echo('host_url', request.host_url) echo('path', request.path) echo('full_path', request.full_path) print() print(request.args) print('userId = %s' % request.args['userId']) return 'hello'if __name__ == '__main__': app.run(port = 80)在第 10 行，定义路径 /query 的处理函数 query()；在第 11 行到第 16 行，打印 request 对象中和 URL 相关的属性；URL 中的查询参数保存在 request.args 中，在第 20 行，打印查询参数 userId 的值。在浏览器中输入 http://localhost/query?userId=123，Flask 程序在终端输出如下：url = http://localhost/query?userId=123base_url = http://localhost/queryhost = localhosthost_url = http://localhost/path = /queryfull_path = /query?userId=123ImmutableMultiDict([('userId', '123')])userId = 123

1.3 服务端支持

服务器端需要对客户端发起的 HTTP 请求做相应的回复，主要是将 HTTP 报文头的 content-type 字段设置成 text/event-stream，下边以 PHP 举例：1129

1.4 server 指令

Syntax: server { ... }Default: —Context: http这里 server 的上下文环境是 http，这说明 server 指令块只能出现在http指令块中，否则会出错。server 指令块中也是许多指令的集合，比如listen指令，表示监听 http 请求的端口，还有 server_name、root、index 等指令。...http { server { # 监听端口 listen 8089; server_name localhost; # 今天资源根路径 root /data/yum_source; # 打开目录浏览功能 autoindex on; # 指定网站初始页，找index.html或者index.htm页面 index index.html index.htm; } ...}...下面我们初步了解下 Nginx 的在一些场景下的配置，使用到的都是一些简单的配置指令。

4.9 开发商品控制器类

我们还是遵循之前的 RESTful 风格，制定后端访问接口如下：动词接口含义接口地址GET查询商品(id=1)信息http://127.0.0.1:8080/goods/1GET查询商品列表信息http://127.0.0.1:8080/goodsPOST新增商品http://127.0.0.1:8080/goodsPUT修改商品(id=1)信息http://127.0.0.1:8080/goods/1DELETE删除商品(id=1)http://127.0.0.1:8080/goods/1我们根据上面的接口列表，实现控制器类代码如下：实例：/** * 商品控制器类 */@RestControllerpublic class GoodsController { @Autowired private GoodsService goodsService; /** * 按id获取商品信息 */ @GetMapping("/goods/{id}") public GoodsDo getOne(@PathVariable("id") long id) { return goodsService.getById(id); } /** * 获取商品列表 */ @GetMapping("/goods") public List<GoodsDo> getList() { return goodsService.getList(); } /** * 新增商品 */ @PostMapping("/goods") public void add(@RequestBody GoodsDo goods) { goodsService.add(goods); } /** * 编辑商品 */ @PutMapping("/goods/{id}") public void update(@PathVariable("id") long id, @RequestBody GoodsDo goods) { // 修改指定id的博客信息 goods.setId(id); goodsService.edit(goods); } /** * 移除商品 */ @DeleteMapping("/goods/{id}") public void delete(@PathVariable("id") long id) { goodsService.remove(id); }}

2.1 创建 Spring Boot web 服务端应用

工程目录结构如下：▾ OAuth2ResourceServer/ ▾ src/ ▾ main/ ▾ java/imooc/springsecurity/oauth2/server/ ▾ config/ OAuth2ResourceServerController.java # 配置控制器，用来扮演资源 OAuth2ResourceServerSecurityConfiguration.java # 资源服务器相关配置均在此处 OAuth2ResourceServerApplication.java # 启动类 ▾ resources/ application.yml # 配置 OAuth2.0 认证服务器的地址等信息 ▸ test/ pom.xml在 pom.xml 文件中增加依赖项，相比「用户名密码认证实例」，此处注意添加了 OAuth2 自动配置的相关依赖。spring-security-oauth2-autoconfigure 和 spring-security-oauth2-resource-server。完整 pom.xml 文件如下：<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.3.1.RELEASE</version> <relativePath/>  </parent> <groupId>org.example</groupId> <artifactId>OAuth2ResourceServer</artifactId> <version>0.0.1-SNAPSHOT</version> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.security</groupId> <artifactId>spring-security-oauth2-resource-server</artifactId> <version>5.3.2.RELEASE</version> </dependency> <dependency> <groupId>org.springframework.security.oauth.boot</groupId> <artifactId>spring-security-oauth2-autoconfigure</artifactId> <version>2.2.5.RELEASE</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build></project>创建 SpringSecurity OAuth2 资源服务器配置类，src/main/java/imooc/springsecurity/oauth2/server/OAuth2ResourceServerSecurityConfiguration.java。使其继承 org.springframework.security.config.annotation.web.configuration.WebSecurityConfigurerAdapter 类，并其增加 @EnableResourceServer 标签，以声明此类作为 OAuth2 资源服务器的配置依据；在 configure(HttpSecurity http) 方法中配置其资源的访问权限，本例中默认所有资源需要认证用户才能访问；完整代码如下：package imooc.springsecurity.oauth2.server.config;import org.springframework.context.annotation.Configuration;import org.springframework.security.config.annotation.web.builders.HttpSecurity;import org.springframework.security.config.annotation.web.configuration.WebSecurityConfigurerAdapter;import org.springframework.security.oauth2.config.annotation.web.configuration.EnableResourceServer;@Configuration@EnableResourceServerpublic class OAuth2ResourceServerSecurityConfiguration extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http .authorizeRequests(authorizeRequests -> authorizeRequests.anyRequest().authenticated() ) .csrf().disable(); }}在 application.yml 文件中，需要将 OAuth2.0 认证服务器的信息配置进去。server: port: 8081security: oauth2: client: client-id: reader # 客户端标识，与认证服务器中的写法相同 client-secret: secret # 客户端秘钥，与认证服务器中的写法相同 user-authorization-uri: http://localhost:8080/oauth/authorize # 客户端鉴权地址 access-token-uri: http://localhost:8080/oauth/token # 客户端获取 Token 地址 resource: id: reader # 资源服务器标识，这里可以根据业务情况填写 token-info-uri: http://localhost:8080/oauth/check_token # 验证 Token 的地址至此，资源服务器的核心内容均配置完成。

206 Partial Content

客户端对服务端的资源进行了某一部分的请求，服务端正常执行，响应报文中包含由 Content-Range 指定范围的实体内容。客户端GET /imooc/video.mp4 HTTP/1.1Range: bytes=1048576-2097152服务端HTTP/1.1 206 Partial ContentContent-Range: bytes 1048576-2097152/3145728Content-Type: video/mp4

4. 案例演示

我们在 nginx.conf 中添加如下的日志配置:...http { log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; map $status $loggable { ~^[34] 0; default 1; } access_log logs/access.log main if=$loggable; server { listen 8000; return 200 '8000, server\n'; } server { listen 8001; return 300 '8001, server\n'; } server { listen 8002; return 401 '8002, server\n'; } ... }...这里我们综合了前面涉及的知识，这里只简单测试日志配置中 if 功能。我们设置请求的相应码为 3xx 和 4xx 时，日志不会记录。接下来，启动或者热加载 Nginx，然后分别对应三个端口发送 http 请求并观察 access.log 日志:[shen@shen ~]$ curl http://180.76.152.113:8000 -IHTTP/1.1 200 OKServer: nginx/1.17.6Date: Tue, 04 Feb 2020 13:31:03 GMTContent-Type: application/octet-streamContent-Length: 13Connection: keep-alive[shen@shen ~]$ curl http://180.76.152.113:8001 -IHTTP/1.1 300 Server: nginx/1.17.6Date: Tue, 04 Feb 2020 13:31:06 GMTContent-Type: application/octet-streamContent-Length: 13Connection: keep-alive[shen@shen ~]$ curl http://180.76.152.113:8002 -IHTTP/1.1 401 UnauthorizedServer: nginx/1.17.6Date: Tue, 04 Feb 2020 13:31:08 GMTContent-Type: application/octet-streamContent-Length: 13Connection: keep-alive# 到 Nginx 主机上观察 access.log 日志，发现只有响应码为200的请求记录了日志[root@server nginx]# tail -f logs/access.log171.82.186.225 - - [04/Feb/2020:21:33:24 +0800] "HEAD / HTTP/1.1" 200 0 "-" "curl/7.29.0" "-"

HTTP 协议安全

4.1 Maven 文档配置

这一段配置代码，其实是固定的格式，表示当前文档是 Maven 配置文档。实例：<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion></project>

1.3 返回数据

根据业务处理完获得返回实体数据，然后遵从 Http 协议格式构造返回的消息报文。浏览器获得到的数据也会根据 Http 协议进行渲染。

4.2 编写布局

菜单本身并不涉及到布局的编写，我们只需要两个 View，一个绑定给 Context Menu，一个给 Popup Menu：<?xml version="1.0" encoding="utf-8"?><LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:app="http://schemas.android.com/apk/res-auto" xmlns:tools="http://schemas.android.com/tools" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" > <TextView android:id="@+id/tv_context" android:layout_width="wrap_content" android:layout_height="wrap_content" android:paddingBottom="30dp" android:text="我这里有 Context Menu" android:textSize="20sp" /> <Button android:id="@+id/bt_popup" android:layout_width="wrap_content" android:layout_height="wrap_content" android:onClick="pop" android:text="我这里有 Popup Menu" /></LinearLayout>

1.5 method 属性

使用表单提交数据时，实际上只发送一个 HTTP 协议的数据请求，HTTP 协议有很多种数据请求方式，这个 method 属性用于设定 HTTP 请求的方式。常用的方式有 post、get，当未设置时默认使用 get 方式。除了常用方式之外，根据服务器 HTTP 网关的设置，还可以支持：options 客户端查看服务器的配置；head 用于获取报文头，没有 body 实体；delete 请求服务器删除指定页面；put 请求替换服务器端文档内容；trace 用于诊断服务器；connect 将连接设置成管道方式的代理服务器，用于 HTTP1.1

websocket

网页中的绝大多数请求使用的是 HTTP 协议，HTTP 是一个无状态的应用层协议，它有着即开即用的优点，每次请求都是相互独立的，这对于密集程度较低的网络请求来说是优点，因为无需创建请求的上下文条件，但是对于密集度或者实时性要求较高的网络请求（例如 IM 聊天）场景来说，可能 HTTP 会力不从心，因为每创建一个 HTTP 请求对服务器来说都是一个很大的资源开销。这时我们可以考虑一个相对性能较高的网络协议 Socket，他的网页版本被称为 Websocket。

2.6 跟踪访问行为

运行启动类，访问 http://127.0.0.1:8080/login?username=imooc&password=123，控制台输出如下：控制台输出内容可见我们已经完整的跟踪了一次对 http://127.0.0.1:8080/login 接口的访问。

1. 前言

上节我们讨论了 Spring Security 如何防范 CSRF 攻击，本节我们讨论如何用最简单的方式提升 Spring Security Web 项目的安全性。Spring Security 可以通过「HTTP 安全响应头」的方式提升安全性。本节我们讨论如何实现 HTTP 安全响应头。

1. HTTP 协议简介

4.2 首页布局

HTTPURLConnection 需要一个触发时机，所以在首页布局上我们放置一个 Button 用于触发 http 请求：<?xml version="1.0" encoding="utf-8"?><LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical"> <Button android:id="@+id/start_http" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_gravity="center" android:layout_marginTop="100dp" android:text="发起 Http 请求" /></LinearLayout>

2. 关于 HTTP 防火墙

Servlet 规范中已经为 HttpServletRequest 定义了一些属性，这些属性通过 Getter 方法访问，并用作匹配处理。这些属性包括：contextPath、servletPath、pathInfo 和 queryString。Spring Security 仅关心应用程序的路径部分，并不关心 contextPath。另一方面，在 Servlet 的规范中，缺少对 servletPath 和 pathInfo 的规定，比如 URL 中每个路径段都可能包含参数，然而这些参数是否应该算作 servletPath 或者 pathInfo 值中，规范却没有明确说明，并且在不同的 Servlet 容器中，其处理行为也不尽相同。当应用程序被部署在不从路径中解析参数的容器中时，攻击者可能将路径参数添加到请求的 URL 中，从而导致模式匹配的成功或者失败。还有另一种情况，路径中可能包含一些如遍历 /../ 或者多个连续正斜杠 // 此类的内容，这也可能导致模式匹配的失效。有的容器在执行 Servlet 映射之前对其做了规范化处理，但不是所有容器都是。默认情况下，这些容器会自动拒绝未规范化的请求，并删除路径参数和重复斜杠。所以，为了保证程序在不同环境的一致性，我们就需要使用 FilterChainProxy 来管理安全过滤器链。还要注意一点，servletPath 和 pathInfo 是由容器解析得出的，因此我们还要避免使用分号。路径的默认匹配策略使用了 Ant 风格，这也是最为常用的一种匹配模式。这个策略是由类 AntPathRequestMatcher 实现的，在 Spring 中由 AntPathMatcher 负责对 servletPath 和 pathInfo 属性执行不区分大小写的模式匹配，此过程中不处理 queryString。有时候，我们会需要更复杂的匹配策略，比如正则表达式，这时候就需要用到 RegexRequestMatcher 对象了。URL 匹配并不适合作为访问控制的唯一策略，我们还需要在服务层使用方法安全性来确保其安全性。由于 URL 是富于变化的，所以我们很难涵盖所有情况，最好的办法是采用白名单方式，只允许确认可用的地址被访问。

HTTP 协议状态码-4XX

4XX 的状态码指的是请求出错了，而且很有可能是客户端侧的异常。客户端侧的异常很多，有时候情况也比较复杂，下面定义的状态码有时候也只能反应一个大概情况，而不一定确切的。

3.1 串行获取 <a href="http://baidu.com">baidu.com</a>、<a href="http://taobao.com">taobao.com</a>、<a href="http://qq.com">qq.com</a> 首页

编写程序 serial.py，该程序以串行的方式获取 baidu、taobao、qq 的首页，内容如下：from datetime import datetimeimport requestsimport threadingdef fetch(url): response = requests.get(url) print('Get %s: %s' % (url, response))time0 = datetime.now()fetch("https://www.baidu.com/")fetch("https://www.taobao.com/")fetch("https://www.qq.com/")time1 = datetime.now()time = time1 - time0print(time.microseconds)在第 5 行，定义了函数 fetch，函数 fetch 获取指定 url 的网页。在第 6 行，调用 requests 模块的 get 方法获取获取指定 url 的网页。在第 9 行，记录执行的开始时间。在第 11 行到第 13 行，串行执行获取 baidu、taobao、qq 的首页。在第 15 行到第 17 行，记录执行的结束时间，并计算总共花费的时间，time.micoseconds 表示完成需要的时间（微秒）。执行 serial.py，输出如下：Get https://www.baidu.com/: <Response [200]>Get https://www.taobao.com/: <Response [200]>Get https://www.qq.com/: <Response [200]>683173在输出中，<Response [200]> 是服务器返回的状态码，表示获取成功。成功获取了 baidu、taobao、qq 的首页，总共用时为 683173 微秒。

2.2 101 Switching Protocols

服务器将遵从客户的请求转换到另外一种协议。常见的就是 Websocket 连接。客户端GET /websocket HTTP/1.1Host: www.imocc.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Protocol: chat, superchatSec-WebSocket-Version: 13客户端请求要将原本是 HTTP/1.1 协议升级成 Websocket 协议。服务端HTTP/1.1 101 Switching ProtocolsUpgrade: websocketConnection: Upgrade服务端返回 101 代表协议转换成功。

3.3 测试

首先直接请求 http://127.0.0.1:8080/info ，由于此时未登录，所以请求被拦截，网页输出如下：访问被拦截如果先请求登录方法 http://127.0.0.1:8080/login?username=imooc&password=123 ，然后访问 http://127.0.0.1:8080/info ，则网页输出：登录成功后，访问正常通过拦截器

3.2 代码集成

开启 saml2Login() 支持；@EnableWebSecuritypublic class SecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http .authorizeRequests() .anyRequest().authenticated() .and() .saml2Login() // 启动 SAML2 认证支持 ; }}为 SAML 2.0 认证配置认证环境；@EnableWebSecuritypublic class SecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http .authorizeRequests() .anyRequest().authenticated() .and() .saml2Login() .relyingPartyRegistrationRepository(...) // 配置认证环境 ; }}在 SAML 2.0 中，SP 和 IDP 都是作为可信成员，将其映射保存在 RelyingPartyRegistration 对象中，RelyingPartyRegistration 对象通过 HttpSecurity 实例中的 .saml2Login().relyingPartyRegistrationRepository() 方法实现其数值配置。至此，最基础的 SAML 2.0 的认证配置就已经完成了。

3.1 重定向到 HTTPS

当客户端使用 HTTP 向服务端发送请求时，Spring Security 可以将请求自动转换为 HTTPS 的连接方式。例如，如下代码强制所有 HTTP 请求重定向为 HTTPS 请求：@Configuration@EnableWebSecuritypublic class WebSecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) { http.requiresChannel(channel -> channel.anyRequest().requiresSecure()); }}

2.2 meta 的属性

name 描述网页content 方便搜索引擎查找和分类http-equiv http文件头设置

3. 图书爬虫之代码实现

根据上面的分析，我们来实现相应的代码。首先是完成获取计算机的所有分类以及相应的 URL 地址：def get_all_computer_book_urls(page_url): """ 获取所有计算机分类图书的url地址 :return: """ response = requests.get(url=page_url, headers=headers) if response.status_code != 200: return [], [] response.encoding = 'gbk' tree = etree.fromstring(response.text, etree.HTMLParser()) # 提取计算机分类的文本列表 c = tree.xpath("//div[@id='wrap']/ul[1]/li[@class='li']/a/text()") # 提取计算机分类的url列表 u = tree.xpath("//div[@id='wrap']/ul[1]/li[@class='li']/a/@href") return c, u我们简单测试下这个函数：[store@server2 chap06]$ python3Python 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linuxType "help", "copyright", "credits" or "license" for more information.>>> from china_pub_crawler import get_all_computer_book_urls>>> get_all_computer_book_urls('http://www.china-pub.com/Browse/')(['IT图书网络出版 [59-00]', '计算机科学理论与基础知识 [59-01]', '计算机组织与体系结构 [59-02]', '计算机网络 [59-03]', '安全 [59-04]', '软件与程序设计 [59-05]', '软件工程及软件方法学 [59-06]', '操作系统 [59-07]', '数据库 [59-08]', '硬件与维护 [59-09]', '图形图像、多媒体、网页制作 [59-10]', '中文信息处理 [59-11]', '计算机辅助设计与工程计算 [59-12]', '办公软件 [59-13]', '专用软件 [59-14]', '人工智能 [59-15]', '考试认证 [59-16]', '工具书 [59-17]', '计算机控制与仿真 [59-18]', '信息系统 [59-19]', '电子商务与计算机文化 [59-20]', '电子工程 [59-21]', '期刊 [59-22]', '游戏 [59-26]', 'IT服务管理 [59-27]', '计算机文化用品 [59-80]'], ['http://product.china-pub.com/cache/browse2/59/1_1_59-00_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-01_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-02_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-03_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-04_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-05_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-06_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-07_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-08_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-09_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-10_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-11_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-12_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-13_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-14_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-15_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-16_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-17_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-18_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-19_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-20_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-21_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-22_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-26_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-27_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-80_0.html'])可以看到这个函数已经实现了我们想要的结果。接下来我们要完成一个函数来获取对应分类下的所有图书信息，不过在此之前，我们需要先完成解析单个图书列表页面的方法：def parse_books_page(html_data): books = [] tree = etree.fromstring(html_data, etree.HTMLParser()) result_tree = tree.xpath("//div[@class='search_result']/table/tr/td[2]/ul") for result in result_tree: try: book_info = {} book_info['title'] = result.xpath("./li[@class='result_name']/a/text()")[0] book_info['book_url'] = result.xpath("./li[@class='result_name']/a/@href")[0] info = result.xpath("./li[2]/text()")[0] book_info['author'] = info.split('|')[0].strip() book_info['publisher'] = info.split('|')[1].strip() book_info['isbn'] = info.split('|')[2].strip() book_info['publish_date'] = info.split('|')[3].strip() book_info['vip_price'] = result.xpath("./li[@class='result_book']/ul/li[@class='book_dis']/text()")[0] book_info['price'] = result.xpath("./li[@class='result_book']/ul/li[@class='book_price']/text()")[0] # print(f'解析出的图书信息为:{book_info}') books.append(book_info) except Exception as e: print("解析数据出现异常，忽略!") return books上面的函数主要解析的是一页图书列表数据，同样基于 xpath 定位相应的元素，然后提取我们想要的数据。其中由于部分信息合在一起，我们在提取数据后还要做相关的处理，分别提取对应的信息。我们可以从网页中直接样 HTML 拷贝下来，然后对该函数进行测试：提取图书列表的网页数据我们把保存的网页命名为 test.html，放到与该代码同级的目录下，然后进入命令行操作：>>> from china_pub_crawler import parse_books_page>>> f = open('test.html', 'r+')>>> html_content = f.read()>>> parse_books_page(html_content)[{'title': '(特价书)零基础学ASP.NET 3.5', 'book_url': 'http://product.china-pub.com/216269', 'author': '王向军;王欣惠（著）', 'publisher': '机械工业出版社', 'isbn': '9787111261414', 'publish_date': '2009-02-01出版', 'vip_price': 'VIP会员价：', 'price': '￥58.00'}, {'title': 'Objective-C 2.0 Mac和iOS开发实践指南(原书第2版)', 'book_url': 'http://product.china-pub.com/3770704', 'author': '(美)Robert Clair （著）', 'publisher': '机械工业出版社', 'isbn': '9787111484561', 'publish_date': '2015-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥79.00'}, {'title': '(特价书)ASP.NET 3.5实例精通', 'book_url': 'http://product.china-pub.com/216272', 'author': '王院峰（著）', 'publisher': '机械工业出版社', 'isbn': '9787111259794', 'publish_date': '2009-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥55.00'}, {'title': '(特价书)CSS+HTML语法与范例详解词典', 'book_url': 'http://product.china-pub.com/216275', 'author': '符旭凌（著）', 'publisher': '机械工业出版社', 'isbn': '9787111263647', 'publish_date': '2009-02-01出版', 'vip_price': 'VIP会员价：', 'price': '￥39.00'}, {'title': '(特价书)Java ME 游戏编程(原书第2版)', 'book_url': 'http://product.china-pub.com/216296', 'author': '(美)Martin J. Wells; John P. Flynt （著）', 'publisher': '机械工业出版社', 'isbn': '9787111264941', 'publish_date': '2009-03-01出版', 'vip_price': 'VIP会员价：', 'price': '￥49.00'}, {'title': '(特价书)Visual Basic实例精通', 'book_url': 'http://product.china-pub.com/216304', 'author': '柴相花（著）', 'publisher': '机械工业出版社', 'isbn': '9787111263296', 'publish_date': '2009-04-01出版', 'vip_price': 'VIP会员价：', 'price': '￥59.80'}, {'title': '高性能电子商务平台构建：架构、设计与开发[按需印刷]', 'book_url': 'http://product.china-pub.com/3770743', 'author': 'ShopNC产品部（著）', 'publisher': '机械工业出版社', 'isbn': '9787111485643', 'publish_date': '2015-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥79.00'}, {'title': '[套装书]Java核心技术卷Ⅰ 基础知识（原书第10版）+Java核心技术卷Ⅱ高级特性（原书第10版）', 'book_url': 'http://product.china-pub.com/7008447', 'author': '（美）凯S.霍斯特曼（Cay S. Horstmann）????（美）凯S. 霍斯特曼（Cay S. Horstmann）（著）', 'publisher': '机械工业出版社', 'isbn': '9787007008447', 'publish_date': '2017-08-01出版', 'vip_price': 'VIP会员价：', 'price': '￥258.00'}, {'title': '(特价书)Dojo构建Ajax应用程序', 'book_url': 'http://product.china-pub.com/216315', 'author': '(美)James E.Harmon （著）', 'publisher': '机械工业出版社', 'isbn': '9787111266648', 'publish_date': '2009-05-01出版', 'vip_price': 'VIP会员价：', 'price': '￥45.00'}, {'title': '(特价书)编译原理第2版.本科教学版', 'book_url': 'http://product.china-pub.com/216336', 'author': '(美)Alfred V. Aho;Monica S. Lam;Ravi Sethi;Jeffrey D. Ullman （著）', 'publisher': '机械工业出版社', 'isbn': '9787111269298', 'publish_date': '2009-05-01出版', 'vip_price': 'VIP会员价：', 'price': '￥55.00'}, {'title': '(特价书)用Alice学编程(原书第2版)', 'book_url': 'http://product.china-pub.com/216354', 'author': '(美)Wanda P.Dann;Stephen Cooper;Randy Pausch （著）', 'publisher': '机械工业出版社', 'isbn': '9787111274629', 'publish_date': '2009-07-01出版', 'vip_price': 'VIP会员价：', 'price': '￥39.00'}, {'title': 'Java语言程序设计(第2版)', 'book_url': 'http://product.china-pub.com/50051', 'author': '赵国玲;王宏;柴大鹏（著）', 'publisher': '机械工业出版社*', 'isbn': '9787111297376', 'publish_date': '2010-03-01出版', 'vip_price': 'VIP会员价：', 'price': '￥32.00'}, {'title': '从零开始学Python程序设计', 'book_url': 'http://product.china-pub.com/7017939', 'author': '吴惠茹（著）', 'publisher': '机械工业出版社', 'isbn': '9787111583813', 'publish_date': '2018-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥79.00'}, {'title': '(特价书)汇编语言', 'book_url': 'http://product.china-pub.com/216385', 'author': '郑晓薇（著）', 'publisher': '机械工业出版社', 'isbn': '9787111269076', 'publish_date': '2009-09-01出版', 'vip_price': 'VIP会员价：', 'price': '￥29.00'}, {'title': '(特价书)Visual Basic.NET案例教程', 'book_url': 'http://product.china-pub.com/216388', 'author': '马玉春;刘杰民;王鑫（著）', 'publisher': '机械工业出版社', 'isbn': '9787111272571', 'publish_date': '2009-09-01出版', 'vip_price': 'VIP会员价：', 'price': '￥30.00'}, {'title': '小程序从0到1：微信全栈工程师一本通', 'book_url': 'http://product.china-pub.com/7017943', 'author': '石桥码农（著）', 'publisher': '机械工业出版社', 'isbn': '9787111584049', 'publish_date': '2018-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥59.00'}, {'title': '深入分布式缓存：从原理到实践', 'book_url': 'http://product.china-pub.com/7017945', 'author': '于君泽（著）', 'publisher': '机械工业出版社', 'isbn': '9787111585190', 'publish_date': '2018-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥99.00'}, {'title': '(特价书)ASP.NET AJAX服务器控件高级编程(.NET 3.5版)', 'book_url': 'http://product.china-pub.com/216397', 'author': '(美)Adam Calderon;Joel Rumerman （著）', 'publisher': '机械工业出版社', 'isbn': '9787111270966', 'publish_date': '2009-09-01出版', 'vip_price': 'VIP会员价：', 'price': '￥65.00'}, {'title': 'PaaS程序设计', 'book_url': 'http://product.china-pub.com/3770830', 'author': '(美)Lucas Carlson （著）', 'publisher': '机械工业出版社', 'isbn': '9787111482451', 'publish_date': '2015-01-01出版', 'vip_price': 'VIP会员价：', 'price': '￥39.00'}, {'title': 'Visual C++数字图像处理[按需印刷]', 'book_url': 'http://product.china-pub.com/2437', 'author': '何斌马天予王运坚朱红莲（著）', 'publisher': '人民邮电出版社', 'isbn': '711509263X', 'publish_date': '2001-04-01出版', 'vip_price': 'VIP会员价：', 'price': '￥72.00'}]是不是能正确提取图书列表的相关信息？这也说明我们的函数的正确性，由于也可能在解析中存在一些异常，比如某个字段的缺失，我们需要捕获异常并忽略该条数据，让程序能继续走下去而不是停止运行。在完成了上述的工作后，我们来通过对页号的 URL 构造，实现采集多个分页下的数据，最后达到读取完该分类下的所有图书信息的目的。完整代码如下：def get_category_books(category, url): """ 获取类别图书，下面会有分页，我们一直请求，直到分页请求返回404即可停止 :return: """ books = [] page = 1 regex = "(http://.*/)([0-9]+)_(.*).html" pattern = re.compile(regex) m = pattern.match(url) if not m: return [] prefix_path = m.group(1) current_page = m.group(2) if current_page != 1: print("提取数据不是从第一行开始，可能存在问题") suffix_path = m.group(3) current_page = page while True: # 构造分页请求的URL book_url = f"{prefix_path}{current_page}_{suffix_path}.html" response = requests.get(url=book_url, headers=headers) print(f"提取分类[{category}]下的第{current_page}页图书数据") if response.status_code != 200: print(f"[{category}]该分类下的图书数据提取完毕!") break response.encoding = 'gbk' # 将该分页的数据加到列表中 books.extend(parse_books_page(response.text)) current_page += 1 # 一定要缓一缓，避免对对方服务造成太大压力 time.sleep(0.5) return books最后保存数据到 MongoDB 中，这一步非常简单，我们前面已经操作过 MongoDB 的文档插入，直接搬用即可：client = pymongo.MongoClient(host='MongoDB的服务地址', port=27017)client.admin.authenticate("admin", "shencong1992")db = client.scrapy_manualcollection = db.china_pub# ...def save_to_mongodb(data): try: collection.insert_many(data) except Exception as e: print("批量插入数据异常:{}".format(str(e)))正是由于我们前面生成了批量的 json 数据，这里直接使用集合的 insert_many() 方法即可对采集到的数据批量插入 MongoDB 中。代码的最后我们加上一个 main 函数即可：# ...if __name__ == '__main__': page_url = "http://www.china-pub.com/Browse/" categories, urls = get_all_computer_book_urls(page_url) # print(categories) books_total = {} for i in range(len(urls)): books_category_data = get_category_books(categories[i], urls[i]) print(f"保存[{categories[i]}]图书数据到mongodb中") save_to_mongodb(books_category_data) print("爬取互动出版网的计算机分类数据完成")这样一个简单的爬虫就完成了，还等什么，开始跑起来吧！！

首页上一页 1 2 3 4 5 6 7 下一页尾页

查看课程详情