总菜鸟才刚开始刮痕。
在我的目录结构里我有这样的..。
#FYI: running on Scrapy 2.4.1
WebScraper/
Webscraper/
spiders/
spider.py # (NOTE: contains spider1 and spider2 classes.)
items.py
middlewares.py
pipelines.py # (NOTE: contains spider1Pipeline and spider2Pipeline)
settings.py # (NOTE: I wrote here:
#ITEM_PIPELINES = {
# 'WebScraper.pipelines.spider1_pipelines': 300,
# 'WebScraper.pipelines.spider2_pipelines': 300,
#}
scrapy.cfgspider2.py就像..。
class OneSpider(scrapy.Spider):
name = "spider1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "spider2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuffpipelines.py看起来像..。
class spider1_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
self.csvwriter.writerow(['header1', 'header2'])
def process_item(self, item, spider):
row = []
row.append(item['header1'])
row.append(item['header2'])
self.csvwrite.writerow(row)
class spider2_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
self.csvwriter.writerow(['header_a', 'header_b'])
def process_item(self, item, spider):
row = []
row.append(item['header_a']) #NOTE: this is not the same as header1
row.append(item['header_b']) #NOTE: this is not the same as header2
self.csvwrite.writerow(row)关于使用一个终端命令在不同的urls上运行spider1和spider2,我有一个问题:
nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log注:这是 (2018年)上一个问题的延伸。
所需的结果: spider1.csv与来自spider1的数据,spider2.csv与来自spider2的数据。
当前结果: spider1.csv带有来自spider1的数据,spider2.csv中断但错误日志包含spider2数据,并且存在一个keyerror ['header1'],即使spider2的项不包括header1,它只包含header_a。
有谁知道如何在不同的urls上一个接一个地运行一个蜘蛛,并将spider1、spider2等获取的数据插入到特定于该蜘蛛的管道中,比如spider1 -> spider1Pipeline -> spider1.csv、spider2 -> spider2Pipelines -> spider2.csv。
或者这是一个从spider1_item指定spider2_item和items.py的问题?我想知道能否以这种方式指定在哪里插入spider2 2的数据。
谢谢!
发布于 2021-01-15 08:21:44
您可以使用custom_settings蜘蛛属性来实现这一点,以单独设置每个蜘蛛的设置。
#spider2.py
class OneSpider(scrapy.Spider):
name = "spider1"
custom_settings = {
'ITEM_PIPELINES': {'WebScraper.pipelines.spider1_pipelines': 300}
...
class TwoSpider(scrapy.Spider):
name = "spider2"
custom_settings = {
'ITEM_PIPELINES': {'WebScraper.pipelines.spider2_pipelines': 300}
...https://stackoverflow.com/questions/65727683
复制相似问题