我正在寻找一个Scrapy Spider,它不是获取URL并抓取它们,而是作为WARC文件的输入(最好是从S3)并将内容发送给parse方法。
实际上,我需要跳过所有的下载阶段,这意味着我想从start_requests方法中返回一个Response,然后发送到parse方法。
到目前为止,这就是我所拥有的:
class WarcSpider(Spider):
name = "warc_spider"
def start_requests(self):
f = warc.WARCFile(fileobj=gzip.open("file.war.gz"))
for record in f:
if record.type == "response":
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
url=record['WARC-Target-URI']
yield Response(url=url, status=200, body=body, headers=headers)
def parse(self, response):
#code that creates item
pass对Scarpy的方法有什么想法吗?
发布于 2014-11-27 20:21:43
你想做的是这样的事情:
class DummyMdw(object):
def process_request(self, request, spider):
record = request.meta['record']
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
url=record['WARC-Target-URI']
return Response(url=url, status=200, body=body, headers=headers)
class WarcSpider(Spider):
name = "warc_spider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'x.DummyMdw': 1}
}
def start_requests(self):
f = warc.WARCFile(fileobj=gzip.open("file.war.gz"))
for record in f:
if record.type == "response":
yield Request(url, callback=self.parse, meta={'record': record})
def parse(self, response):
#code that creates item
passhttps://stackoverflow.com/questions/27174640
复制相似问题