首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >爬虫突然死亡时如何保持状态?

爬虫突然死亡时如何保持状态?
EN

Stack Overflow用户
提问于 2016-08-30 07:50:15
回答 1查看 431关注 0票数 1

这个公式是引用抓取蜘蛛不存储状态(持久状态)

我按照下面的链接来保持爬虫http://doc.scrapy.org/en/latest/topics/jobs.html的状态

现在,当爬行器在中断或Ctrl+C中正确结束时,这是非常好的工作方式。

我注意到蜘蛛没有在

  1. 你击中了Ctrl +C多次。
  2. 服务器容量被击中。
  3. 它突然结束的任何其他原因

当蜘蛛再次运行时,它会在爬起来的第一个url上关闭自己。

当发生上述情况时,如何实现爬行器的持久状态?否则它就会再次爬上一整堆urls。

当蜘蛛再次运行时记录:

代码语言:javascript
复制
2016-08-30 08:14:11 [scrapy] INFO: Scrapy 1.1.2 started (bot: maxverstappen)
2016-08-30 08:14:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'maxverstappen.spiders', 'SPIDER_MODULES': ['maxverstappen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'maxverstappen'}
2016-08-30 08:14:11 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.spiderstate.SpiderState']
2016-08-30 08:14:11 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-30 08:14:11 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-30 08:14:12 [scrapy] INFO: Enabled item pipelines:
['maxverstappen.pipelines.MaxverstappenPipeline']
2016-08-30 08:14:12 [scrapy] INFO: Spider opened
2016-08-30 08:14:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-30 08:14:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/robots.txt> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/robots.txt> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.inautonews.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.newsnow.co.uk': <GET http://www.newsnow.co.uk/h/Life+&+Style/Motoring>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.americanmuscle.com': <GET http://www.americanmuscle.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.extremeterrain.com': <GET http://www.extremeterrain.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.autoanything.com': <GET http://www.autoanything.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.bmwcoop.com': <GET http://www.bmwcoop.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.automotorblog.com': <GET http://www.automotorblog.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/inautonews>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/inautonews>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET https://plus.google.com/+Inautonewsplus>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.histats.com': <GET http://www.histats.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.hamiltonf1site.com': <GET http://www.hamiltonf1site.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.joshwellsracing.com': <GET http://www.joshwellsracing.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jensonbuttonfan.net': <GET http://www.jensonbuttonfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.fernandoalonsofan.net': <GET http://www.fernandoalonsofan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.markwebberfan.net': <GET http://www.markwebberfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.felipemassafan.net': <GET http://www.felipemassafan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nicorosbergfan.net': <GET http://www.nicorosbergfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nickheidfeldfan.net': <GET http://www.nickheidfeldfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.lewishamiltonblog.net': <GET http://www.lewishamiltonblog.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.timoglockfan.net': <GET http://www.timoglockfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jarnotrullifan.net': <GET http://www.jarnotrullifan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.brunosennafan.net': <GET http://www.brunosennafan.net/>
2016-08-30 08:14:12 [scrapy] INFO: Closing spider (finished)
2016-08-30 08:14:12 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 896,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 35353,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 4,
 'dupefilter/filtered': 149,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 724932),
 'log_count/DEBUG': 28,
 'log_count/INFO': 7,
 'offsite/domains': 22,
 'offsite/filtered': 23,
 'request_depth_max': 1,
 'response_received_count': 4,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/disk': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/disk': 2,
 'start_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 13456)}
2016-08-30 08:14:12 [scrapy] INFO: Spider closed (finished)
EN

回答 1

Stack Overflow用户

发布于 2016-08-30 08:07:18

这样做的一种方法是通过有两个蜘蛛来区分发现和消费者逻辑。一种是发现产品urls,另一种是使用这些urls,并返回每个urls的结果。如果由于某种原因,使用者在运行过程中死亡,它可以很容易地恢复爬行,因为发现队列不受此崩溃的影响。

已经有一个伟大的工具,刮刮的家伙,正是这样做。它叫前翅目

Frontera是一个由爬行前沿和分布/缩放基元组成的web爬行框架,允许构建大规模的在线web爬虫。 Frontera负责在爬行过程中遵循的逻辑和策略。它存储爬虫提取的链接并对其排序,以决定下一步访问哪些页面,并能够以分布式方式进行访问。

这听起来很复杂,但却相当直截了当。但是,如果您运行了一些小规模的,并且关闭了一个,您可能只想手动处理这个问题。您可以运行发现蜘蛛并在json中输出结果,然后在使用者蜘蛛中以持久的方式解析json (即从它中弹出值)。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/39221755

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档