首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >水平缩放Scrapyd

水平缩放Scrapyd
EN

Stack Overflow用户
提问于 2015-07-25 02:39:05
回答 1查看 869关注 0票数 7

如果需要,您会使用什么工具或工具集来水平扩展scrapyd,将新机器动态添加到scrapyd集群中,并让每台机器拥有N个实例。所有实例共享一个公共作业队列并不是必须的,但这将是非常棒的。

Scrapy-cluster似乎很适合这项工作,但我想要一个基于Scrapyd的解决方案,所以我听取了其他替代方案和建议。

EN

回答 1

Stack Overflow用户

发布于 2017-11-10 01:13:57

我使用Scrapyd的API和一个wrapper为我自己的负载平衡器编写了脚本。

代码语言:javascript
复制
from random import shuffle
from scrapyd_api.wrapper import ScrapydAPI

class JobLoadBalancer(object):

    @classmethod
    def get_less_occupied(
            cls,
            servers_urls=settings.SERVERS_URLS,
            project=settings.DEFAULT_PROJECT,
            acceptable=settings.ACCEPTABLE_PENDING):

        free_runner = {'num_jobs': 9999, 'client': None}
        # shuffle servers optimization
        shuffle(servers_urls)
        for url in servers_urls:
            scrapyd = ScrapydAPI(target=url)
            jobs = scrapyd.list_jobs(project)
            num_jobs = len(jobs['pending'])

            if free_runner['num_jobs'] > num_jobs:
                free_runner['num_jobs'] = num_jobs
                free_runner['client'] = scrapyd
            # Optimization: if found acceptable pending operations in one server stop looking for another one
            if free_runner['client'] and free_runner['num_jobs'] <= acceptable:
                break

        return free_runner['client']

单元测试:

代码语言:javascript
复制
def setUp(self):
        super(TestFactory, self).setUp()
        # Make sure this servers are running
        settings.SERVERS_URLS = [
            'http://localhost:6800',
            'http://localhost:6900'
        ]
        self.project = 'dummy'
        self.spider = 'dummy_spider'
        self.acceptable = 0

    def test_get_less_occupied(self):
        # add new dummy jobs to first server so that choose the second one
        scrapyd = ScrapydAPI(target=settings.SERVERS_URLS[0])
        scrapyd.schedule(project=self.project, spider=self.spider)
        scrapyd.schedule(project=self.project, spider=self.spider)
        second_server_url = settings.SERVERS_URLS[1]
        scrapyd = JobLoadBalancer.get_less_occupied(
            servers_urls=settings.SERVERS_URLS,
            project=self.project,
            acceptable=self.acceptable)
        self.assertEqual(scrapyd.target, second_server_url)

这段代码针对的是一个旧版本的scrapyd,因为它是一年多前编写的。

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/31617562

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档