文章/答案/技术大牛

发布

社区首页 >问答首页 >scrapy.Request()阻止我进入我的函数

问scrapy.Request()阻止我进入我的函数
EN

Stack Overflow用户

提问于 2018-01-22 03:54:51

回答 2查看 55关注 0票数 0

大家好~我对刮刮很陌生，遇到了一个很奇怪的问题。简单地说，我发现scrapy.Request()阻止我进入我的函数。

这是我的密码：

# -*- coding: utf-8 -*-
import scrapy
from tutor_job_spy.items import TutorJobSpyItem

class Spyspider(scrapy.Spider):
    name = 'spy'
    #for privacy reasons I delete the url information :)
    allowed_domains = ['']
    url_0 = ''
    start_urls = [url_0, ]
    base_url = ''
    list_previous = []
    list_present = []

    def parse(self, response):
        numbers = response.xpath(  '//tr[@bgcolor="#d7ecff" or @bgcolor="#eef7ff"]/td[@width="8%" and @height="40"]/span/text()').extract()
        self.list_previous = numbers
        self.list_present = numbers
        yield scrapy.Request(self.url_0, self.keep_spying)

    def keep_spying(self, response):
        numbers = response.xpath('//tr[@bgcolor="#d7ecff" or @bgcolor="#eef7ff"]/td[@width="8%" and @height="40"]/span/text()').extract()
        self.list_previous = self.list_present
        self.list_present = numbers
        # judge if anything new
        if (self.list_present != self.list_previous):  
            self.goto_new_demand(response)
        #time.sleep(60)  #from cache
        yield scrapy.Request(self.url_0, self.keep_spying, dont_filter=True)

    def goto_new_demand(self, response):
        new_demand_links = []
        detail_links = response.xpath('//div[@class="ShowDetail"]/a/@href').extract()
        for i in range(len(self.list_present)):
            if (self.list_present[ i] not in self.list_previous):  
                new_demand_links.append(self.base_url + detail_links[i])
        if (new_demand_links != []):
            for new_demand_link in new_demand_links:
                yield scrapy.Request(new_demand_link, self.get_new_demand)

    def get_new_demand(self, response):
        new_demand = TutorJobSpyItem()
        new_demand['url'] = response.url
        requirments = response.xpath('//tr[@#bgcolor="#eef7ff"]/td[@colspan="2"]/div/text()').extract()[0]
        new_demand['gender'] = self.get_gender(requirments)
        new_demand['region'] = response.xpath('//tr[@bgcolor="#d7ecff"]/td[@align="left"]/text()').extract()[5]
        new_demand['grade'] = response.xpath('//tr[@bgcolor="#d7ecff"]/td[@align="left"]/text()').extract()[7]
        new_demand['subject'] = response.xpath('//tr[@bgcolor="#eef7ff"]/td[@align="left"]/text()').extract()[2]
        return new_demand

    def get_gender(self, requirments):
        if ('女老师' in requirments):
            return 'F'
        elif ('男老师' in requirments):
            return 'M'
        else:
            return 'Both okay'

问题是，当我调试时，我发现我无法进入goto_new_demand。

if (self.list_present != self.list_previous):  
    self.goto_new_demand(response)

每次我运行或调试这个脚本时，它都会跳过goto_new_demand，但是在我在goto_new_demand中注释yield scrapy.Request(new_demand_link, self.get_new_demand)之后，我就可以进入它了。我尝试了很多次，发现只有当goto_new_demand中没有yyield scrapy.Request(new_demand_link, self.get_new_demand)时，我才能进入它。为什么会这样？

预先感谢任何能给出建议的人:)

PS：

刮伤: 1.5.0

lxml : 4.1.1.0

libxml2 : 2.9.5

cssselect : 1.0.3

parsel : 1.3.1

w3lib : 1.18.0

扭曲: 17.9.0

Python : 3.6.3 (v3.6.3:2c5fed8,2017年10月3日，18:11:49) MSC v.1900 64位(AMD64)

pyOpenSSL : 17.5.0 (OpenSSL 1.1.0g，2017年11月2日)

密码学: 2.1.4

平台:Windows7-6.1.7601-SP1

问题解决了！

我将生成器 goto_new_demand修改为函数 goto_new_demand。因此，这个问题完全是由于我对 problem (一种生成器)的一点理解所造成的。

下面是修改的代码：

if (self.list_present != self.list_previous):
    # yield self.goto_new_demand(response)
    new_demand_links = self.goto_new_demand(response)
    if (new_demand_links != []):
        for new_demand_link in new_demand_links:
            yield scrapy.Request(new_demand_link, self.get_new_demand)

def goto_new_demand(self, response):
    new_demand_links = []
    detail_links = response.xpath('//div[@class="ShowDetail"]/a/@href').extract()
    for i in range(len(self.list_present)):
        if (self.list_present[ i] not in self.list_previous):
            new_demand_links.append(self.base_url + detail_links[i])
    return new_demand_links

原因在于巴拉克的回答。

python

scrapy

web-crawler

回答 2

Stack Overflow用户

发布于 2018-01-22 06:21:20

文档中描述了调试Scrapy蜘蛛的正确方法。特别有用的技术是使用刮壳来检测响应。

票数 0

Stack Overflow用户

发布于 2018-01-22 06:35:59

我想你可能需要改变这份声明

if (self.list_present != self.list_previous):  
    self.goto_new_demand(response)

至：

if (self.list_present != self.list_previous):  
    yield self.goto_new_demand(response)

因为self.goto_new_demand()只是一个生成器(它在函数中有just语句)，所以简单地使用self.goto_new_demand(response)不会运行任何东西。

生成器的一个简单示例可能会让您更清楚地了解这一点：

def a():
    print("hello")

# invoke a will print out hello
a()

但是对于一个生成器，只需调用它就会返回一个生成器：

def a():
    yield
    print("hello")

# invoke a will not print out hello, instead it will return a generator object
a()

因此，在scrapy中，您应该使用yield self.goto_new_demand(response)使goto_new_demand(response)实际运行。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48374455

复制

相似问题

问scrapy.Request()阻止我进入我的函数
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问scrapy.Request()阻止我进入我的函数EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问scrapy.Request()阻止我进入我的函数
EN