首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用多个start_urls抓取多个页面

用多个start_urls抓取多个页面
EN

Stack Overflow用户
提问于 2021-05-02 08:11:38
回答 1查看 105关注 0票数 1

我希望使用scrapy对json表单中的细节进行擦拭。它们是多个start_urls,每个start_url都有多个页面可供刮取。我只是无法理解如何这样做的逻辑。

代码语言:javascript
复制
import scrapy
from scrapy.http import Request

BASE_URL = ["https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/civic/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/human-rights-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/child-rights-2/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/health-9/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/environment-18/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/education-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/women-s-rights-13/petitions?offset={}&limit=8&show_promoted_cards=true"
        ]

class ChangeSpider(scrapy.Spider):
    name = 'change'

    def start_requests(self):
        for i in range(len(BASE_URL)):
            yield Request(BASE_URL[i], callback = self.parse)

    pageNumber = 11

    def parse(self, response):
        data = response.json()
        for item in range(len(data['items'])):
            yield {
                "petition_id": data['items'][item]['petition']['id'],
            }

        next_page = "https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset=" + str(ChangeSpider.pageNumber) + "&limit=8&show_promoted_cards=true"       
        if data['last_page'] == False:
            ChangeSpider.pageNumber += 1
            yield response.follow(next_page, callback=self.parse) 
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-05-05 06:09:13

就像这样:

代码语言:javascript
复制
import scrapy
from scrapy.http import Request


class ChangeSpider(scrapy.Spider):
    name = 'change'

    start_urls = ["https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/civic/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/human-rights-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/child-rights-2/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/health-9/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/environment-18/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/education-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/women-s-rights-13/petitions?offset={}&limit=8&show_promoted_cards=true"
        ]

    pageNumber = 11

    def parse(self, response):
        data = response.json()
        for item in range(len(data['items'])):
            yield {
                "petition_id": data['items'][item]['petition']['id'],
            }

        next_page = "https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset=" + str(ChangeSpider.pageNumber) + "&limit=8&show_promoted_cards=true"       
        if data['last_page'] == False:
            ChangeSpider.pageNumber += 1
            yield response.follow(next_page, callback=self.parse) 
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67354221

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档