我写了一个爬虫来抓取一个大站点。我把它放在scrapehub上,并使用crawlera插件。没有crawlera,我的蜘蛛在scrapehub上运行得很好。一旦我切换到crawlera中间件,爬虫就不需要爬行就退出了。
我在没有crawlera的情况下运行了爬虫,它在我的本地系统上运行,也在scrapehub上运行,我唯一改变的是为crawlera启用的中间件。没有crawlera它就会运行,有了它就不会。我已将并发请求数设置为C10计划限制
CRAWLERA_APIKEY = <apikey>
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 10
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
DOWNLOADER_MIDDLEWARES = {
#'ytscraper.middlewares.YtscraperDownloaderMiddleware': 543,
'scrapy_crawlera.CrawleraMiddleware': 300
}Here is the log dump
2019-02-06 05:54:34 INFO Log opened.
1: 2019-02-06 05:54:34 INFO [scrapy.log] Scrapy 1.5.1 started
2: 2019-02-06 05:54:34 INFO [scrapy.utils.log] Scrapy 1.5.1 started (bot: ytscraper)
3: 2019-02-06 05:54:34 INFO [scrapy.utils.log] Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.15 (default, Nov 16 2018, 23:19:37) - [GCC 4.9.2], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Linux-4.4.0-141-generic-x86_64-with-debian-8.11
4: 2019-02-06 05:54:34 INFO [scrapy.crawler] Overridden settings: {'NEWSPIDER_MODULE': 'ytscraper.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 'CONCURRENT_REQUESTS': 10, 'SPIDER_MODULES': ['ytscraper.spiders'], 'AUTOTHROTTLE_ENABLED': True, 'LOG_ENABLED': False, 'DOWNLOAD_TIMEOUT': 600, 'MEMUSAGE_LIMIT_MB': 950, 'BOT_NAME': 'ytscraper', 'TELNETCONSOLE_HOST': '0.0.0.0'}
5: 2019-02-06 05:54:34 INFO [scrapy.middleware] Enabled extensions: More
6: 2019-02-06 05:54:34 INFO [scrapy.middleware] Enabled downloader middlewares: Less
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
u'scrapy_crawlera.CrawleraMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']
7: 2019-02-06 05:54:34 INFO [scrapy.middleware] Enabled spider middlewares: Less
['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware',
'sh_scrapy.middlewares.HubstorageSpiderMiddleware',
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
8: 2019-02-06 05:54:34 INFO [scrapy.middleware] Enabled item pipelines: More
9: 2019-02-06 05:54:34 INFO [scrapy.core.engine] Spider opened
10: 2019-02-06 05:54:34 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
11: 2019-02-06 05:54:34 INFO [root] Using crawlera at http://proxy.crawlera.com:8010 (user: 11b143d...)
12: 2019-02-06 05:54:34 INFO [root] CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the crawl significantly
13: 2019-02-06 05:54:34 INFO TelnetConsole starting on 6023
14: 2019-02-06 05:54:40 INFO [scrapy.core.engine] Closing spider (finished)
15: 2019-02-06 05:54:40 INFO [scrapy.statscollectors] Dumping Scrapy stats: More
16: 2019-02-06 05:54:40 INFO [scrapy.core.engine] Spider closed (finished)
17: 2019-02-06 05:54:40 INFO Main loop terminated.下面是没有crawlera中间件的同一个爬虫的日志
0: 2019-02-05 17:42:13 INFO Log opened.
1: 2019-02-05 17:42:13 INFO [scrapy.log] Scrapy 1.5.1 started
2: 2019-02-05 17:42:13 INFO [scrapy.utils.log] Scrapy 1.5.1 started (bot: ytscraper)
3: 2019-02-05 17:42:13 INFO [scrapy.utils.log] Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.15 (default, Nov 16 2018, 23:19:37) - [GCC 4.9.2], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Linux-4.4.0-135-generic-x86_64-with-debian-8.11
4: 2019-02-05 17:42:13 INFO [scrapy.crawler] Overridden settings: {'NEWSPIDER_MODULE': 'ytscraper.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'CONCURRENT_REQUESTS_PER_DOMAIN': 32, 'CONCURRENT_REQUESTS': 32, 'SPIDER_MODULES': ['ytscraper.spiders'], 'AUTOTHROTTLE_ENABLED': True, 'LOG_ENABLED': False, 'DOWNLOAD_TIMEOUT': 600, 'MEMUSAGE_LIMIT_MB': 950, 'BOT_NAME': 'ytscraper', 'TELNETCONSOLE_HOST': '0.0.0.0'}
5: 2019-02-05 17:42:13 INFO [scrapy.middleware] Enabled extensions: More
6: 2019-02-05 17:42:14 INFO [scrapy.middleware] Enabled downloader middlewares: More
7: 2019-02-05 17:42:14 INFO [scrapy.middleware] Enabled spider middlewares: More
8: 2019-02-05 17:42:14 INFO [scrapy.middleware] Enabled item pipelines: More
9: 2019-02-05 17:42:14 INFO [scrapy.core.engine] Spider opened
10: 2019-02-05 17:42:14 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
11: 2019-02-05 17:42:14 INFO [root] Using crawlera at http://proxy.crawlera.com:8010 (user: 11b143d...)
12: 2019-02-05 17:42:14 INFO [root] CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the crawl significantly
13: 2019-02-05 17:42:14 INFO TelnetConsole starting on 6023
14: 2019-02-05 17:43:14 INFO [scrapy.extensions.logstats] Crawled 17 pages (at 17 pages/min), scraped 16 items (at 16 items/min)
15: 2019-02-05 17:44:14 INFO [scrapy.extensions.logstats] Crawled 35 pages (at 18 pages/min), scraped 34 items (at 18 items/min)
16: 2019-02-05 17:45:14 INFO [scrapy.extensions.logstats] Crawled 41 pages (at 6 pages/min), scraped 40 items (at 6 items/min)
17: 2019-02-05 17:45:30 INFO [scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force
18: 2019-02-05 17:45:30 INFO [scrapy.core.engine] Closing spider (shutdown)
19: 2019-02-05 17:45:38 INFO [scrapy.statscollectors] Dumping Scrapy stats: More
20: 2019-02-05 17:45:38 INFO [scrapy.core.engine] Spider closed (shutdown)
21: 2019-02-05 17:45:38 INFO Main loop terminated.我用python写了一个脚本来测试我的爬虫连接。
import requests
response = requests.get(
"https://www.youtube.com",
proxies={
"http": "http://<APIkey>:@proxy.crawlera.com:8010/",
},
)
print(response.text)这是可行的,但我无论如何也不能让爬虫与crawlera中间件一起工作。
我想用crawlera bc得到同样的结果,而不会很快被禁止。
请帮帮忙。
发布于 2019-02-06 21:57:38
您的设置中缺少CRAWLERA_ENABLED = True。
有关更多信息,请参阅scrapy-crawlera文档的Configuration部分。
发布于 2019-02-09 07:42:06
日志中的数据与问题定义不符。在这两种情况下,爬虫都使用了crawlera代理,因为两个日志都有下面这一行:
INFO [root] Using crawlera at http://proxy.crawlera.com:8010 (user: 11b143d...)根据scrapy_crawlera.CrawleraMiddleware源代码,这意味着在这两种情况下都启用了CrawleraMiddleware。我需要来自日志的额外数据(至少统计数据(包含统计数据的日志结束行))
目前我有以下假设:
根据第一个日志,您没有覆盖cookies设置,并且启用了CookiesMiddleware。
默认情况下,scrapy启用处理cookie。
通常,网站使用cookie来跟踪访问者的活动/会话。
如果网站接收到来自多个in的单个sessionId的请求(就像任何爬虫程序对启用的爬虫和启用的cookies所做的那样)-这将允许its服务器识别代理使用情况,并通过其存储在cookies中的唯一sessionId禁止所有使用的in。因此,在这种情况下,爬虫会因为IP禁令而停止工作。( crawlera的其他用户将在一段时间内无法向该站点发送请求)
应通过将COOKIES_ENABLED设置为False来禁用Cookie
https://stackoverflow.com/questions/54551104
复制相似问题