首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何将APscheduler与scrapy一起使用

如何将APscheduler与scrapy一起使用
EN

Stack Overflow用户
提问于 2015-04-21 14:57:45
回答 1查看 2.1K关注 0票数 4

从脚本(http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script)运行scrapy crawler的代码但它不起作用。

代码语言:javascript
复制
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log,signals
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings

def run():
    spider =EgovSpider()
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configured
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()


from apscheduler.schedulers.twisted import TwistedScheduler
sched = TwistedScheduler()
sched.add_job(run, 'interval', seconds=10)
sched.start()

我的蜘蛛:

代码语言:javascript
复制
import scrapy

class EgovSpider(scrapy.Spider):
    name = 'egov'
    start_urls = ['http://egov-buryatia.ru/index.php?id=1493']


    def parse(self, response):

        data = response.xpath("//div[@id='main_wrapper_content_news']//tr//text()").extract()
        print data
        print response.url
        f = open("vac.txt","a")
        for d in data:
            f.write(d.encode(encoding="UTF-8") + "\n")

        f.write(str(now))
        f.close()

如果我替换行"reactor.run()",爬虫已经在10秒后启动了一次:

代码语言:javascript
复制
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log,signals
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings

def run():  
    spider =EgovSpider()
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configured
    crawler.crawl(spider)
    crawler.start()
    log.start()

from apscheduler.schedulers.twisted import TwistedScheduler
sched = TwistedScheduler()
sched.add_job(run, 'interval', seconds=10)
sched.start()
reactor.run()

我对python和英语的经验很少:)请帮帮我。

EN

回答 1

Stack Overflow用户

发布于 2017-02-15 16:32:11

我今天遇到了同样的问题。以下是一些信息。

扭曲的反应堆一旦运行和停止就不能重新启动。您应该启动一个长时间运行的反应器,并定期添加crawler任务。

为了进一步简化代码,可以使用包含reactor.run()的CrawlerProcess.start()。

代码语言:javascript
复制
from scrapy.crawler import CrawlerProcess
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

process = CrawlerProcess(get_project_settings())
sched = TwistedScheduler()
sched.add_job(process.crawl, 'interval', args=[EgovSpider], seconds=10)
sched.start()
process.start(False)    # Do not stop reactor after spider closes
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/29765039

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档