文章/答案/技术大牛

发布

社区首页 >问答首页 >如何修复刮伤重置连接

问如何修复刮伤重置连接
EN

Stack Overflow用户

提问于 2022-07-12 18:59:27

回答 1查看 146关注 0票数 0

在使用Selenium中间件时，尝试用scrapy抓取几个urls时，我总是会出错。

Middleware.py：

class SeleniumMiddleWare(object):

    def __init__(self):
        path = "G:/Downloads/chromedriver.exe"
        options = uc.ChromeOptions()
        options.headless=True
        chrome_prefs = {}
        options.experimental_options["prefs"] = chrome_prefs
        chrome_prefs["profile.default_content_settings"] = {"images": 2}
        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
        self.driver=  uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)
       

    def process_request(self, request, spider):
        try:
            self.driver.get(request.url)
        except:
            pass
        content = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

    def process_response(self, request, response, spider):
        return response

Spider.py：

class SeleniumSpider(scrapy.Spider):
    name = 'steamdb'
    #allowed_domains = ['steamdb.info']
    start_urls = ['https://steamdb.info/graph/']
    
    def parse(self, response):  
        table = response.xpath('//*[@id="table-apps"]/tbody')
        rows = table.css('tr[class= "app"]')
        #b= a.css('tr [class = "app"]::text')
        #table = b.xpath('//*[@id="table-apps"]/tbody/tr')

        for element in rows:
            link = "https://steamdb.info".join(element.css('::attr(href)').get())
            name = element.css('a ::text')[0].get()
            game_info = {"Link": link, "Name": name}
            yield scrapy.Request(url =link, callback = self.parse_info, cb_kwargs= dict(game_info = game_info))

    
    def parse_info(self, response, game_info ):
        game_info["sales"] = response.xpath('//*[@id="graphs"]/div[5]/div[2]/ul/li[1]/strong/span/text()').getall()
        yield game_info

注意:刮板工作，不使用cb_kwargs提出新的请求，并遵循链接。如果我只在start_urls中刮页，它就能工作，但是当我向其他urls发出新的请求或跟踪页面时，它就不能工作了。

错误：

2022-07-12 20:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamdb.info/graph/> (referer: https://steamdb.info/graph/)
2022-07-12 20:53:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:52304/session/99578d3d4f168c77b58a85f67be06927/execute/sync {"script": "return navigator.webdriver", "args": []}
2022-07-12 20:53:54 [urllib3.connectionpool] DEBUG: Resetting dropped connection: localhost
2022-07-12 20:53:56 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:56 [urllib3.connectionpool] WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5EB66EC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
2022-07-12 20:53:56 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (2): localhost:52304
2022-07-12 20:53:58 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:58 [urllib3.connectionpool] WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5ED6C970>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync

scrapy

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-07-12 19:06:30

the target machine actively refused it意味着服务器响应，但是指定的端口(52304)被关闭。你能检查一下你能不能访问它吗？也许是本地防火墙挡住了它？

UPD:看起来在每个process_request中都调用了process_request，要么重新启动驱动程序，要么在完成之前不要调用.quit()

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72957179

复制

相似问题

问如何修复刮伤重置连接
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何修复刮伤重置连接EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何修复刮伤重置连接
EN