当我在scrapy中启用以下frontera中间件时
我丢失了所有response对象中的所有引用标头
不管怎样,我可以保留推荐人吗?
当我删除以下行时,引用程序是可用的,但我需要启用这些frontera中间件
SPIDER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'另外,referermiddleware是启用的,当scrapy启动时,我可以在调试日志中看到它
编辑:这是我的全部配置文件内容
BOT_NAME = 'crawler'
SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'
USER_AGENT = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36"
DOWNLOAD_DELAY = 2
DUPEFILTER=True
ITEM_PIPELINES = {
'crawler.pipelines.AllDataPipeline': 300
}
SPIDER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES = {}
RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408]
REFERER_ENABLED = True
######################################################################
# Frontera Settings
#######################################################################
BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFO'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///frontier.db'
HTTPCACHE_ENABLED = False
REDIRECT_ENABLED = True
COOKIES_ENABLED = False
DOWNLOAD_TIMEOUT = 20
RETRY_ENABLED = False
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 2
LOGSTATS_INTERVAL = 10
SPIDER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES = {}
SPIDER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 699,
})
DOWNLOADER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'发布于 2016-03-04 14:24:37
DupeFilter不可能是真的。你可以像
DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter.RFPDupeFilter"如果不使用dupe筛选器执行请求,可以将dont_filter=True kwargs添加到scrapy.Request中。
https://stackoverflow.com/questions/32335210
复制相似问题