首页手记爬虫采集-基于webkit核心的客户端Ghost.py...

爬虫采集-基于webkit核心的客户端Ghost.py [爬虫实例]

标签：

Python

对与要时不时要抓取页面的我们来说，是痛苦的~

由于目前的Web开发中AJAX、Javascript、CSS的大量使用，一些网站上的重要数据是由Ajax或Javascript动态生成的，并不能直接通过解析html页面内容就能获得（例如采用urllib2，mechanize、lxml、Beautiful Soup ）。要实现对这些页面数据的爬取，爬虫必须支持Javacript、DOM、HTML解析。

比如：像监控的数据就不能用简单的curl和urllib解析到的。。。

153204615.jpg

还有这个用ajax 渲染的页面，用urllib2直接解析不了的。

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

153356238.jpg

常见的抓数据的方法：

urllib2+urlparse+re

最原始的办法，其中urllib2是python的web库、urlparse能处理url、re是正则库，这种方法写起来比较繁琐，但也比较“实在”

urllib2+beautifulsoup

这里的得力干将是beautifulsoup，beautifulsoup可以非常有效的解析HTML页面，就可以免去自己用re去写繁琐的正则等。

Mechanize+BeautifulSoup

Mechanize是对于urllib2的部分功能的替换，使得除了http以外其他任何连接也都能被打开，也更加动态可配置

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

其实像上面的页面，要是不嫌麻烦，可以从页面狂找接口，下出来的大多是xml的格式，然后你再费劲的去解析。。。是在他折腾了。

这时候大家可以用 webkit核心的web 客户端。他会像真正的浏览器一样来解析页面的。

WebKit: Safari, Google Chrome,傲游3 360浏览器等等都是基于 Webkit 核心开发。

我们一般是终端取值的，这些也有不少封装好的工具

Pyv8，PythonWebKit，Selenium，PhantomJS，Ghost.py 等等。。。。

我这里推荐用ghost.py 。。。。因为他够直接和实用

发现国内webkit的资料很少，ghost.py的资料就更少了，那我就根据官方的文档，简单的翻译下 ~

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

一个小例子，感受下Ghost~

from ghost import Ghost

ghost = Ghost()

page, extra_resources = ghost.open("http://xiaorui.cc")

assert page.http_status==200 and 'xiaorui' in ghost.content

173049454.png

安装Ghost.py　以及相关的东东～~

用webkit，我们需要有pyqt或者是PySide

这些都安装好了后，再开始

运气好的直接 pip install Ghost.py

运气不好的：

中间会遇到好多蛋疼的问题，大家多搜搜~

要是解决不了了，请回帖哈~

wget http://sourceforge.net/projects/pyqt/files/sip/sip-4.14.6/sip-4.14.6.tar.gz

tar zxvf sip-4.14.6.tar.gz

cd sip-4.14.6

python configure.py

make

sudo make install

wget http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.1/PyQt-mac-gpl-4.10.1.tar.gz

tar zxvf PyQt-mac-gpl-4.10.1.tar.gz

cd PyQt-mac-gpl-4.10.1

python configure.py

make

sudo make install

wget http://pyside.markus-ullmann.de/pyside-1.1.1-qt48-py27apple.pkg

open pyside-1.1.1-qt48-py27apple.pkg

git clone https://github.com/mitsuhiko/flask.git

cd flask

sudo python setup.py install

git clone git://github.com/carrerasrodrigo/Ghost.py.git

cd Ghost.py

sudo python setup.py install

创建一个实例对象：

from ghost import Ghost

ghost = Ghost()

打开一个页面

page, resources = ghost.open('http://my.web.page')

夹带着 javascript代码

result, resources = ghost.evaluate(

"document.getElementById('my-input').getAttribute('value');")

模拟点击事件

page, resources = ghost.evaluate(

"document.getElementById('link').click();", expect_loading=True)

填写表单中的字段中的值 (selector, value, blur=True, expect_loading=False):

result, resources = ghost.set_field_value("input[name=username]", "jeanphix")

If you set optional parameter `blur` to False, the focus will be left on the field (usefull for autocomplete tests).

For filling file input field, simply pass file path as `value`.

你可以填写form表单 Ghost.fill(selector, values, expect_loading=False):

result, resources = ghost.fill("form", {

"username": "jeanphix",

"password": "mypassword"

})

提交表单~

page, resources = ghost.fire_on("form", "submit", expect_loading=True)

这是对于高级属性的定义：

173358913.png

这些有很多好用的属性

wait_for_page_loaded()

That wait until a new page is loaded.

page, resources = ghost.wait_for_page_loaded()

这个是等页面都加载完毕，类似jquery

$(document).ready(function()

wait_for_selector(selector)

That wait until a element match the given selector.

result, resources = ghost.wait_for_selector("ul.results")

等你指定的dom名称出现

wait_for_text(text)

That wait until the given text exists inside the frame.

result, resources = ghost.wait_for_selector("My result")

等我们要的字符出现

官网出现了 FlASK 的例子：

可以通过ghost.py和unittest实现程序的单元测试：

import unittest

from flask import Flask

from ghost import GhostTestCase

app = Flask(__name__)

@app.route('/')

def home():

return 'hello world'

class MyTest(GhostTestCase):

port = 5000

@classmethod

def create_app(cls):

return app

def test_open_home(self):

self.ghost.open("http://localhost:%s/" % self.port)

self.assertEqual(self.ghost.content, 'hello world')

if __name__ == '__main__':

unittest.main()

~~~整体的小demo~~~

# Opens the web page

ghost.open('http://www.openstreetmap.org/')

# Waits for form search field

ghost.wait_for_selector('input[name=query]')

# Fills the form

ghost.fill("#search_form", {'query': 'France'})

# Submits the form

ghost.fire_on("#search_form", "submit")

# Waits for results (an XHR has been called here)

ghost.wait_for_selector(

'#search_osm_nominatim .search_results_entry a')

# Clicks first result link

ghost.click(

'#search_osm_nominatim .search_results_entry:first-child a')

# Checks if map has moved to expected latitude

lat, resources = ghost.evaluate("map.center.lat")

assert float(lat.toString()) == 5860090.806537

aha，咱们来个实例哈~

咱们来个简单的模拟浏览器到百度去搜 xiaorui.cc 然后看看内容和headers头：

182057256.png

终端下的操作：

182759722.png

得到的是

http://www.baidu.com/s?wd=xiaorui.cc&rsv_bp=0&ch=&tn=baidu&bar=&rsv_spt=3&ie=utf-8

咱们访问下

看他的http头

In [10]: print page.headers

{u'BDQID': u'0xf594a31a03344b4f', u'Content-Encoding': u'gzip', u'Set-Cookie': u'BDSVRTM=381; path=/\nH_PS_PSSID=2976_2981_3091; path=/; domain=.baidu.com', u'BDUSERID': u'0', u'Server': u'BWS/1.0', u'Connection': u'Keep-Alive', u'Cache-Control': u'private', u'Date': u'Tue, 03 Sep 2013 09:53:56 GMT', u'Content-Type': u'text/html;charset=utf-8', u'BDPAGETYPE': u'3'}

他的内容：

183150831.png

先这样吧~ 更详细的功能大家看官网吧~

爬虫ghost.pypython webkitpython应用

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

holdtom

手记
篇

粉丝

240

获赞与收藏

991

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 24345 820

Python 算法入门教程

15个小节 25755 1013

Python 进阶应用教程

38个小节 61404 957

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

爬虫采集-基于webkit核心的客户端Ghost.py [爬虫实例]

阅读免费教程