在过去的几天里,我在我的Scrapy/twisted中遇到了麻烦,因为它应该运行不同的蜘蛛并分析它们的输出。不幸的是,MySpider2依赖于 MySpider1 的提要,因此只能在MySpider1完成后才能运行。此外,MySpider1和MySpider2有不同的设置。到目前为止还没有找到一种解决方案,可以让蜘蛛按照各自独特的设置顺序运行。我看过Scrapy CrawlerRunner和CrawlerProcess 文档,并尝试了几个相关的堆栈溢出问题(按顺序运行多个蜘蛛、刮刮:如何一个接一个地运行两个爬虫?、从脚本中运行多个蜘蛛和其他问题),但都没有成功。
在关于顺序蜘蛛的文档之后,我的(稍作修改的)代码如下:
from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner
spider_settings = [{
'FEED_URI':'abc.csv',
'LOG_FILE' :'abc/log.log'
#MORE settings are here
},{
'FEED_URI' : '123.csv',
'LOG_FILE' :'123/log.log'
#MORE settings are here
}]
spiders = [MySpider1, MySpider2]
process = CrawlerRunner(spider_settings[0])
process = CrawlerRunner(spider_settings[1]) #Not sure if this is how its supposed to be used for
#multiple settings but passing this line before "yield process.crawl(spiders[1])" also results in an error.
@defer.inlineCallbacks
def crawl():
yield process.crawl(spiders[0])
yield process.crawl(spiders[1])
reactor.stop()
crawl()
reactor.run()但是,使用此代码,只有第一个蜘蛛被执行,并且没有任何设置。因此,我尝试使用CrawlerProcess,效果稍微好一点:
from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner
spider_settings = [{
'FEED_URI':'abc.csv',
'LOG_FILE' :'abc/log.log'
#MORE settings are here
},{
'FEED_URI' : '123.csv',
'LOG_FILE' :'123/log.log'
#MORE settings are here
}]
spiders = [MySpider1, MySpider2]
process = CrawlerProcess(spider_settings[0])
process = CrawlerProcess(spider_settings[1])
@defer.inlineCallbacks
def crawl():
yield process.crawl(spiders[0])
yield process.crawl(spiders[1])
reactor.stop()
crawl()
reactor.run()这段代码同时执行两种蜘蛛,但不是按预期顺序执行。此外,它还在第二次之后用spider1的设置覆盖蜘蛛的设置,导致日志文件仅在两行之后被切断,并在123/log.log中恢复对这两个蜘蛛的设置。
在一个完美的世界中,我的片段将工作如下:
提前谢谢你的帮助。
发布于 2020-06-08 04:06:20
把跑步者分开,它应该能工作。
process_1 = CrawlerRunner(spider_settings[0])
process_2 = CrawlerRunner(spider_settings[1])
#...
@defer.inlineCallbacks
def crawl():
yield process_1.crawl(spiders[0])
yield process_2.crawl(spiders[1])
reactor.stop()
#...https://stackoverflow.com/questions/62252561
复制相似问题