在本网站https://postal-code.co.uk/输入关键字后,我试图抓取结果,但被定向到另一个只有“受限”的网站,通过使用带有关键字的链接:https://postal-code.co.uk/ajax/search.php?word=Edward+Avenue,+Camberley,+Surrey,+GU15&geocodeProvider=1,我尝试在其标题中添加一个引用程序,当使用命令: scrapy.http.Request(url='https://postal-code.co.uk/ajax/search.php?word=Edward+Avenue%2C+Camberley%2C+Surrey%2C+GU15&geocodeProvider=1',headers={'Referer':'https://postal-code.co.uk/'})但仍然无法解决它时,请帮助.谢谢。
发布于 2022-01-12 02:47:58
该网站使用ajax加载搜索结果。它还检查与请求一起发送的标头和cookie。请参阅示例scrapy蜘蛛,其中我发送了所需的特定标头。刮刮会自动处理曲奇。
import scrapy
class PostalSpider(scrapy.Spider):
name = 'postal'
allowed_domains = ['postal-code.co.uk']
start_urls = ['https://postal-code.co.uk']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
def parse(self, response):
headers = {
'accept': 'application/json, text/javascript, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://postal-code.co.uk/',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
yield response.follow("https://postal-code.co.uk/ajax/search.php?word=Edward+Avenue,+Camberley,+Surrey,+GU15&geocodeProvider=1",
callback=self.parse_results, headers=headers)
def parse_results(self, response):
for result in response.json():
yield result蜘蛛运行的示例屏幕截图

https://stackoverflow.com/questions/70668253
复制相似问题