我正在抓取2800万页,我的刮破蜘蛛开始的很快,并逐渐放慢速度。我怀疑服务器阻塞我,因为我可以运行第二个蜘蛛,它将再次快速启动。不是硬件,而是运行在一个24 is内存的很好的vps上。允许的域名就是那个网站。经济放缓的原因可能是什么?
如果我停止这份工作并立即恢复它,它又很快地开始了。
2022-11-11 12:25:39 [scrapy.core.engine] INFO: Spider opened
2022-11-11 12:25:39 [scrapy.core.scheduler] INFO: Resuming crawl (97145 requests scheduled)
2022-11-11 12:25:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-11 12:25:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-11 12:26:39 [scrapy.extensions.logstats] INFO: Crawled 1633 pages (at 1633 pages/min), scraped 1629 items (at 1629 items/min)
2022-11-11 12:27:39 [scrapy.extensions.logstats] INFO: Crawled 3242 pages (at 1609 pages/min), scraped 3238 items (at 1609 items/min)
2022-11-11 12:28:39 [scrapy.extensions.logstats] INFO: Crawled 4736 pages (at 1494 pages/min), scraped 4733 items (at 1495 items/min)
2022-11-11 12:29:40 [scrapy.extensions.logstats] INFO: Crawled 5914 pages (at 1178 pages/min), scraped 5906 items (at 1173 items/min)
2022-11-11 12:30:39 [scrapy.extensions.logstats] INFO: Crawled 7198 pages (at 1284 pages/min), scraped 7190 items (at 1284 items/min)
2022-11-11 12:31:40 [scrapy.extensions.logstats] INFO: Crawled 8417 pages (at 1219 pages/min), scraped 8408 items (at 1218 items/min)
2022-11-11 12:32:40 [scrapy.extensions.logstats] INFO: Crawled 9557 pages (at 1140 pages/min), scraped 9553 items (at 1145 items/min)
2022-11-11 12:33:40 [scrapy.extensions.logstats] INFO: Crawled 10617 pages (at 1060 pages/min), scraped 10612 items (at 1059 items/min)
2022-11-11 12:34:40 [scrapy.extensions.logstats] INFO: Crawled 11629 pages (at 1012 pages/min), scraped 11623 items (at 1011 items/min)
2022-11-11 12:35:40 [scrapy.extensions.logstats] INFO: Crawled 12592 pages (at 963 pages/min), scraped 12587 items (at 964 items/min)
2022-11-11 12:36:39 [scrapy.extensions.logstats] INFO: Crawled 13499 pages (at 907 pages/min), scraped 13493 items (at 906 items/min)
2022-11-11 12:37:40 [scrapy.extensions.logstats] INFO: Crawled 14368 pages (at 869 pages/min), scraped 14364 items (at 871 items/min)
2022-11-11 12:38:40 [scrapy.extensions.logstats] INFO: Crawled 15161 pages (at 793 pages/min), scraped 15153 items (at 789 items/min)
2022-11-11 12:39:40 [scrapy.extensions.logstats] INFO: Crawled 15884 pages (at 723 pages/min), scraped 15881 items (at 728 items/min)
2022-11-11 12:40:40 [scrapy.extensions.logstats] INFO: Crawled 16665 pages (at 781 pages/min), scraped 16657 items (at 776 items/min)
2022-11-11 12:41:40 [scrapy.extensions.logstats] INFO: Crawled 17417 pages (at 752 pages/min), scraped 17409 items (at 752 items/min)
2022-11-11 12:42:40 [scrapy.extensions.logstats] INFO: Crawled 18140 pages (at 723 pages/min), scraped 18132 items (at 723 items/min)
2022-11-11 12:43:40 [scrapy.extensions.logstats] INFO: Crawled 18844 pages (at 704 pages/min), scraped 18836 items (at 704 items/min)
2022-11-11 12:44:40 [scrapy.extensions.logstats] INFO: Crawled 19528 pages (at 684 pages/min), scraped 19516 items (at 680 items/min)
2022-11-11 12:45:40 [scrapy.extensions.logstats] INFO: Crawled 20188 pages (at 660 pages/min), scraped 20180 items (at 664 items/min)
2022-11-11 12:46:40 [scrapy.extensions.logstats] INFO: Crawled 20836 pages (at 648 pages/min), scraped 20828 items (at 648 items/min)
2022-11-11 12:47:39 [scrapy.extensions.logstats] INFO: Crawled 21460 pages (at 624 pages/min), scraped 21452 items (at 624 items/min)
2022-11-11 12:48:40 [scrapy.extensions.logstats] INFO: Crawled 22014 pages (at 554 pages/min), scraped 22006 items (at 554 items/min)
2022-11-11 12:49:40 [scrapy.extensions.logstats] INFO: Crawled 22588 pages (at 574 pages/min), scraped 22580 items (at 574 items/min)
2022-11-11 12:50:40 [scrapy.extensions.logstats] INFO: Crawled 23159 pages (at 571 pages/min), scraped 23151 items (at 571 items/min)
2022-11-11 12:51:39 [scrapy.extensions.logstats] INFO: Crawled 23731 pages (at 572 pages/min), scraped 23723 items (at 572 items/min)
2022-11-11 12:52:40 [scrapy.extensions.logstats] INFO: Crawled 24299 pages (at 568 pages/min), scraped 24291 items (at 568 items/min)
2022-11-11 12:53:40 [scrapy.extensions.logstats] INFO: Crawled 24847 pages (at 548 pages/min), scraped 24839 items (at 548 items/min)
2022-11-11 12:54:40 [scrapy.extensions.logstats] INFO: Crawled 25385 pages (at 538 pages/min), scraped 25377 items (at 538 items/min)
2022-11-11 12:55:39 [scrapy.extensions.logstats] INFO: Crawled 25917 pages (at 532 pages/min), scraped 25909 items (at 532 items/min)
2022-11-11 12:56:40 [scrapy.extensions.logstats] INFO: Crawled 26441 pages (at 524 pages/min), scraped 26433 items (at 524 items/min)
2022-11-11 12:57:40 [scrapy.extensions.logstats] INFO: Crawled 26953 pages (at 512 pages/min), scraped 26945 items (at 512 items/min)
2022-11-11 12:58:40 [scrapy.extensions.logstats] INFO: Crawled 27442 pages (at 489 pages/min), scraped 27440 items (at 495 items/min)
2022-11-11 12:59:40 [scrapy.extensions.logstats] INFO: Crawled 27882 pages (at 440 pages/min), scraped 27874 items (at 434 items/min)
2022-11-11 13:00:40 [scrapy.extensions.logstats] INFO: Crawled 28372 pages (at 490 pages/min), scraped 28364 items (at 490 items/min)
2022-11-11 13:01:40 [scrapy.extensions.logstats] INFO: Crawled 28856 pages (at 484 pages/min), scraped 28848 items (at 484 items/min)
2022-11-11 13:02:40 [scrapy.extensions.logstats] INFO: Crawled 29332 pages (at 476 pages/min), scraped 29324 items (at 476 items/min)
2022-11-11 13:03:40 [scrapy.extensions.logstats] INFO: Crawled 29800 pages (at 468 pages/min), scraped 29792 items (at 468 items/min)
2022-11-11 13:04:39 [scrapy.extensions.logstats] INFO: Crawled 30260 pages (at 460 pages/min), scraped 30252 items (at 460 items/min)
2022-11-11 13:05:40 [scrapy.extensions.logstats] INFO: Crawled 30720 pages (at 460 pages/min), scraped 30712 items (at 460 items/min)
2022-11-11 13:06:40 [scrapy.extensions.logstats] INFO: Crawled 31166 pages (at 446 pages/min), scraped 31158 items (at 446 items/min)我试过自动油门,另一个VPS和同样的结果。
发布于 2022-11-12 00:08:35
它并不是真正的减速,它只是因为它一次管理的并发过程的数量而出现这样的情况。
AutoThrottle可以帮助减轻这种行为,但它只影响到刮取工作流的一端。此外,蜘蛛的输出/提要端也是异步的,并且经常会积累大量并发作业,迭代所有这些作业将需要更长时间才能清除。
另外,如果您有任何自定义中间件,而这些中间件恰好在计算上非常昂贵,这也会导致显著的慢下来。
您可以使用CONCURRENT_ITEMS来调整可以并发处理的输出过程的数量,还有类似于AutoThrottle的CONCURRENT_REQUESTS、CONCURRENT_REQUESTS_PER_DOMAIN和CONCURRENT_REQUESTS_PER_IP设置。任何这些调整都可以提高蜘蛛的输出速度。
您还可以使用scrapy日志记录和信号API来帮助准确地确定减速的确切位置。但是,我应该注意到,刮伤总是在爬行开始时跑得更快。在启动/重新启动爬行时,调度程序是完全空的,因此处理的第一项将因此快速地在工作流中移动。
https://stackoverflow.com/questions/74403037
复制相似问题