首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在Scrpay Spider中动态创建JOBDIR设置?

如何在Scrpay Spider中动态创建JOBDIR设置?
EN

Stack Overflow用户
提问于 2018-09-07 17:17:00
回答 2查看 388关注 0票数 0

我想从爬行器__init__中创建JOBDIR设置,或者在调用该爬行器时动态创建。我想为不同的爬行器创建不同的JOBDIR,就像下面示例中的FEED_URI

代码语言:javascript
复制
    class QtsSpider(scrapy.Spider):
    name = 'qts'
    custom_settings = {
        'FEED_URI': 'data_files/' + '%(site_name)s.csv',
        'FEED_FORMAT': "csv",
        #'JOBDIR': 'resume/' + '%(site_name2)s'
    }
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']


    def __init__(self, **kw):
        super(QtsSpider, self).__init__(**kw)
        self.site_name = kw.get('site_name')

    def parse(self, response):
        #our rest part of code 

我们通过这种方式调用该脚本

代码语言:javascript
复制
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


def main_function():
    all_spiders = ['spider1','spider2','spider3'] # 3 different spiders
    process = CrawlerProcess(get_project_settings())
    for spider_name in all_spiders:
        process.crawl('qts', site_name = spider_name )

    process.start()

main_function()

如何实现像FEED_URI这样不同爬虫动态创建JOBDIR?我们将非常感谢您的帮助。

EN

回答 2

Stack Overflow用户

发布于 2018-09-07 23:29:45

确切地说是如何设置site_name的,您可以传递另一个参数,

代码语言:javascript
复制
process.crawl('qts', site_name=spider_name, jobdir='dirname that you want to keep')

将以爬行器属性的形式提供,因此您可以编写

代码语言:javascript
复制
def __init__(self):
    jobdir = getattr(self, 'jobdir', None)    

    if jobdir:
        self.custom_settings['JOBDIR'] = jobdir
票数 0
EN

Stack Overflow用户

发布于 2020-08-20 06:00:03

我发现自己需要同样的功能,主要是因为不想重复地向每个爬行器的custom_settings属性添加一个自定义JOBDIR。因此,我创建了一个简单的extension,它继承了Scrapy用来保存爬行状态的原始SpiderState扩展。

代码语言:javascript
复制
from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os


class SpiderStateManager(SpiderState):
    """
    SpiderState Purpose: Store and load spider state during a scraping job
    Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
    Reasoning: Reduces repetitive code
    Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
        Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
    """

    def __init__(self, jobdir=None):
        self.jobdir = jobdir
        super(SpiderStateManager, self).__init__(jobdir=self.jobdir)

    @classmethod
    def from_crawler(cls, crawler):
        base_jobdir = crawler.settings['JOBDIR']
        if not base_jobdir:
            raise NotConfigured
        spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
        if not os.path.exists(spider_jobdir):
            os.makedirs(spider_jobdir)

        obj = cls(spider_jobdir)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

要启用它,请记住向您的settings.py添加适当的设置,如下所示

代码语言:javascript
复制
EXTENSIONS = {
    # We want to disable the original SpiderState extension and use our own
    "scrapy.extensions.spiderstate.SpiderState": None,
    "spins.extensions.SpiderStateManager": 0
}
JOBDIR = "C:/Users/CaffeinatedMike/PycharmProjects/ScrapyDapyDoo/jobs"
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52219321

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档