文章/答案/技术大牛

发布

社区首页 >问答首页 >SCRAPY分页:无限滚动分页

问SCRAPY分页:无限滚动分页
EN

Stack Overflow用户

提问于 2021-04-18 21:25:19

回答 1查看 86关注 0票数 0

我正在尝试从website中获取数据。我已经设法从网站的第一页抓取了数据。

但对于下一页网站使用AJAX加载数据，为此，我设置了标题，但无法从下一页获取数据。

如果我们在没有标题的情况下向网站发送请求，我们得到的数据是相同的。因此，也许我没有正确地设置页眉来移动到下一页。我使用CURL作为头文件。

我哪里做错了？

class MenSpider(scrapy.Spider):
    name = "MenCrawler"
    allowed_domains = ['monark.com.pk']

    #define headers and 'custom_constraint' as page
    headers = {        
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
        'accept-language': 'en-PK,en-US;q=0.9,en;q=0.8',
        'key':'274246071',
        'custom_constraint':'custom-filter page=1',
        'view' : 'ajax',
        '_':'1618681277011'
    }
 
    #send request
    def start_requests(self):
        yield scrapy.Request(
            url = 'https://monark.com.pk/collections/t-shirts',
            method = 'GET',       
            headers=self.headers,
            callback=self.update_headers
        )
        

    #response
    def update_headers(self,response):

        #extract all the 12 URLS from the page           
        urls = response.xpath('//h4[@class="h6 m-0 ff-main"]/a/@href').getall()
        for url in urls:   
            yield response.follow(url=url, callback=self.parse)

        #extract the infinite text as 'LOADING'
        load = response.xpath('//div[@class="pagination"]//span/text()').get()

        #Use if Condition for pagination
        if load == 'LOADING':
            page = 1

            #define page no as key form dictionary           
            key = self.headers['custom_constraint']     
            current_page = key.split('=')[-1]
            next_pag = page+int(current_page)
            filters = 'custom-filter page='+str(next_pag)                
            self.headers['custom_constraint'] = filters
        
            #request againg to page for next page BUT THIS IS NOT WORKING FOR ME
            yield scrapy.Request(
            url = 'https://monark.com.pk/collections/t-shirts',
            method = 'GET',        
            headers=self.headers,
            callback=self.update_headers            
            )

    def parse(self, response):
        ........

python

web-scraping

scrapy

回答 1

Stack Overflow用户

发布于 2021-04-19 00:41:06

您的代码正在重用相同的键，这可能是导致相同页面再次加载的原因。尝试从标头中删除“key”或确定它们是如何创建的

以下是我在初步检查中发现的关键

https://monark.com.pk/collections/t-shirts?key=172181120&custom_constraint=custom-filter+page=4&view=ajax&_=1618763278994
https://monark.com.pk/collections/t-shirts?key=205204897&custom_constraint=custom-filter+page=5&view=ajax&_=1618763278995

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67149091

复制

相似问题

问SCRAPY分页:无限滚动分页
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问SCRAPY分页:无限滚动分页EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问SCRAPY分页:无限滚动分页
EN