我正在尝试从website中获取数据。我已经设法从网站的第一页抓取了数据。
但对于下一页网站使用AJAX加载数据,为此,我设置了标题,但无法从下一页获取数据。
如果我们在没有标题的情况下向网站发送请求,我们得到的数据是相同的。因此,也许我没有正确地设置页眉来移动到下一页。我使用CURL作为头文件。
我哪里做错了?

class MenSpider(scrapy.Spider):
name = "MenCrawler"
allowed_domains = ['monark.com.pk']
#define headers and 'custom_constraint' as page
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
'accept-language': 'en-PK,en-US;q=0.9,en;q=0.8',
'key':'274246071',
'custom_constraint':'custom-filter page=1',
'view' : 'ajax',
'_':'1618681277011'
}
#send request
def start_requests(self):
yield scrapy.Request(
url = 'https://monark.com.pk/collections/t-shirts',
method = 'GET',
headers=self.headers,
callback=self.update_headers
)
#response
def update_headers(self,response):
#extract all the 12 URLS from the page
urls = response.xpath('//h4[@class="h6 m-0 ff-main"]/a/@href').getall()
for url in urls:
yield response.follow(url=url, callback=self.parse)
#extract the infinite text as 'LOADING'
load = response.xpath('//div[@class="pagination"]//span/text()').get()
#Use if Condition for pagination
if load == 'LOADING':
page = 1
#define page no as key form dictionary
key = self.headers['custom_constraint']
current_page = key.split('=')[-1]
next_pag = page+int(current_page)
filters = 'custom-filter page='+str(next_pag)
self.headers['custom_constraint'] = filters
#request againg to page for next page BUT THIS IS NOT WORKING FOR ME
yield scrapy.Request(
url = 'https://monark.com.pk/collections/t-shirts',
method = 'GET',
headers=self.headers,
callback=self.update_headers
)
def parse(self, response):
........发布于 2021-04-19 00:41:06
您的代码正在重用相同的键,这可能是导致相同页面再次加载的原因。尝试从标头中删除“key”或确定它们是如何创建的
以下是我在初步检查中发现的关键
https://monark.com.pk/collections/t-shirts?key=172181120&custom_constraint=custom-filter+page=4&view=ajax&_=1618763278994
https://monark.com.pk/collections/t-shirts?key=205204897&custom_constraint=custom-filter+page=5&view=ajax&_=1618763278995https://stackoverflow.com/questions/67149091
复制相似问题