首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Scrapy不使用Crawlera

Scrapy不使用Crawlera
EN

Stack Overflow用户
提问于 2015-08-11 10:19:02
回答 1查看 1.2K关注 0票数 0

我一直在使用Crawlera和Scrapy,它很棒。然而,我在Crawlera仪表板中更改了我的API密钥,从那以后我就无法让Crawlera工作了。我联系了他们的客户支持,他们说API密钥工作正常。我决定尝试让Crawlera使用Scrapy文档中的示例。不走运。Scrapy正在向"dmoz.org“而不是paygo.com发出请求。我已经安装了scrapy-crawlera以及scrapy。

日志如下:

代码语言:javascript
复制
[scrapy] INFO: Using crawlera at http://paygo.crawlera.com:8010?noconnect (user: [my_api_key])
2015-08-10 19:16:24 [scrapy] DEBUG: Telnet console listening on [my_ip_address]
2015-08-10 19:16:26 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2015-08-10 19:16:26 [scrapy] INFO: Closing spider (finished)
2015-08-10 19:16:26 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16445,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 8, 11, 2, 16, 26, 990760),
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'log_count/WARNING': 2,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 8, 11, 2, 16, 24, 720987)}
2015-08-10 19:16:26 [scrapy] INFO: Spider closed (finished)

任何帮助或想法,为什么会发生这种情况将非常感谢。

代码语言:javascript
复制
#settings file
BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'

DOWNLOADER_MIDDLEWARES = {'scrapy_crawlera.CrawleraMiddleware': 600}
CRAWLERA_ENABLED = True
CRAWLERA_USER = '[my_api_key]'
CRAWLERA_PASS = ''
CRAWLERA_PRESERVE_DELAY = True
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600

# items file
import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

#spider file
import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
EN

回答 1

Stack Overflow用户

发布于 2015-08-20 06:41:16

在setting.py文件中,您需要配置您的"DOWNLOADER_MIDDLEWARES“。

例如:

代码语言:javascript
复制
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

CRAWLERA_ENABLED = True
CRAWLERA_USER = 'user'
CRAWLERA_PASS = 'password'

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_crawlera.CrawleraMiddleware': 600,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': "http://" + CRAWLERA_USER + ":" + CRAWLERA_PASS + "@proxy.crawlera.com:8010/",
}
票数 -1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/31932153

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档