对于一个项目,我运行了大量的Scrapy请求特定的搜索条件。这些请求使用相同的搜索条件,但时间范围不同,如下面URL中的日期所示。
尽管URL引用的日期和页面不同,但我收到的值与所有请求的输出值相同。看起来,脚本正在接受获得的第一个值,并将相同的输出分配给所有后续请求。
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
]
def parse(self, response):
item = {
'search_title': response.css('input#sbhost::attr(value)').get(),
'results': response.css('#resultStats::text').get(),
'url': response.url,
}
yield item我找到了一个线程用BeautifulSoup讨论一个类似的问题。解决方案是在脚本中添加标题,从而使其使用浏览器作为用户代理:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)在Scrapy 但似乎是不一样的中应用标头的方法。有没有人知道如何最好地将它包含在Scrapy中,特别是参考start_urls,它同时包含多个URL?
发布于 2019-02-15 06:10:19
您不需要在这里修改标题。您需要设置Scrapy允许您直接执行的用户代理。
import scrapy
class QuotesSpider(scrapy.Spider):
# ...
user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
# ...现在,您将得到如下输出:
'results': 'About 357 results', ...
'results': 'About 215 results', ...
'results': 'About 870 results', ...发布于 2019-10-01 05:13:22
根据刮除1.7.3文件。您的标题将不像其他标题一样通用。它应该是相同的网站,你正在刮。您将从控制台网络选项卡了解报头。
如下所示添加它们并打印响应。
# -*- coding: utf-8 -*-
import scrapy
#import logging
class AaidSpider(scrapy.Spider):
name = 'aaid'
def parse(self, response):
url = "https://www.eventscribe.com/2019/AAOMS-CSIOMS/ajaxcalls/PresenterInfo.asp?efp=SVNVS1VRTEo4MDMx&PresenterID=597498&rnd=0.8680339"
# Set the headers here.
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'www.eventscribe.com',
'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
# Send the request
scrapy.http.Request(url, method='GET' , headers = headers, dont_filter=False)
print(response.body) #If the response is HTML
#If the response is json ; import json
#jsonresponse = json.loads(response.body_as_unicode())
#print jsonresponsehttps://stackoverflow.com/questions/54699365
复制相似问题