在使用Selenium中间件时,尝试用scrapy抓取几个urls时,我总是会出错。
Middleware.py:
class SeleniumMiddleWare(object):
def __init__(self):
path = "G:/Downloads/chromedriver.exe"
options = uc.ChromeOptions()
options.headless=True
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
self.driver= uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)
def process_request(self, request, spider):
try:
self.driver.get(request.url)
except:
pass
content = self.driver.page_source
self.driver.quit()
return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)
def process_response(self, request, response, spider):
return responseSpider.py:
class SeleniumSpider(scrapy.Spider):
name = 'steamdb'
#allowed_domains = ['steamdb.info']
start_urls = ['https://steamdb.info/graph/']
def parse(self, response):
table = response.xpath('//*[@id="table-apps"]/tbody')
rows = table.css('tr[class= "app"]')
#b= a.css('tr [class = "app"]::text')
#table = b.xpath('//*[@id="table-apps"]/tbody/tr')
for element in rows:
link = "https://steamdb.info".join(element.css('::attr(href)').get())
name = element.css('a ::text')[0].get()
game_info = {"Link": link, "Name": name}
yield scrapy.Request(url =link, callback = self.parse_info, cb_kwargs= dict(game_info = game_info))
def parse_info(self, response, game_info ):
game_info["sales"] = response.xpath('//*[@id="graphs"]/div[5]/div[2]/ul/li[1]/strong/span/text()').getall()
yield game_info注意:刮板工作,不使用cb_kwargs提出新的请求,并遵循链接。如果我只在start_urls中刮页,它就能工作,但是当我向其他urls发出新的请求或跟踪页面时,它就不能工作了。
错误:
2022-07-12 20:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamdb.info/graph/> (referer: https://steamdb.info/graph/)
2022-07-12 20:53:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:52304/session/99578d3d4f168c77b58a85f67be06927/execute/sync {"script": "return navigator.webdriver", "args": []}
2022-07-12 20:53:54 [urllib3.connectionpool] DEBUG: Resetting dropped connection: localhost
2022-07-12 20:53:56 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:56 [urllib3.connectionpool] WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5EB66EC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
2022-07-12 20:53:56 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (2): localhost:52304
2022-07-12 20:53:58 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:58 [urllib3.connectionpool] WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5ED6C970>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync发布于 2022-07-12 19:06:30
the target machine actively refused it意味着服务器响应,但是指定的端口(52304)被关闭。你能检查一下你能不能访问它吗?也许是本地防火墙挡住了它?
UPD:看起来在每个process_request中都调用了process_request,要么重新启动驱动程序,要么在完成之前不要调用.quit()
https://stackoverflow.com/questions/72957179
复制相似问题