为了账号安全,请及时绑定邮箱和手机立即绑定

如何在Python脚本中运行Scrapy

如何在Python脚本中运行Scrapy

智慧大石 2019-12-26 10:47:00
我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法。我找到2个资料来解释这一点:http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/http://snipplr.com/view/67006/using-scrapy-from-a-script/我不知道应该把我的Spider代码放在哪里以及如何从main函数中调用它。请帮忙。这是示例代码:# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. # # The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.# # [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. #!/usr/bin/pythonimport osos.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other importsfrom scrapy import log, signals, projectfrom scrapy.xlib.pydispatch import dispatcherfrom scrapy.conf import settingsfrom scrapy.crawler import CrawlerProcessfrom multiprocessing import Process, Queueclass CrawlerScript():    def __init__(self):        self.crawler = CrawlerProcess(settings)        if not hasattr(project, 'crawler'):            self.crawler.install()        self.crawler.configure()        self.items = []        dispatcher.connect(self._item_passed, signals.item_passed)    def _item_passed(self, item):        self.items.append(item)    def _crawl(self, queue, spider_name):        spider = self.crawler.spiders.create(spider_name)        if spider:            self.crawler.queue.append_spider(spider)        self.crawler.start()        self.crawler.stop()        queue.put(self.items)    def crawl(self, spider):        queue = Queue()        p = Process(target=self._crawl, args=(queue, spider,))        p.start()        p.join()        return queue.get(True)
查看完整描述

3 回答

?
LEATH

TA贡献1936条经验 获得超6个赞

所有其他答案均参考Scrapyv0.x。根据更新的文档,Scrapy 1.0要求:


import scrapy

from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):

    # Your spider definition

    ...


process = CrawlerProcess({

    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'

})


process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished


查看完整回答
反对 回复 2019-12-26
  • 3 回答
  • 0 关注
  • 1106 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信