文章/答案/技术大牛

发布

社区首页 >问答首页 >理解Scrapy的CrawlSpider规则

问理解Scrapy的CrawlSpider规则
EN

Stack Overflow用户

提问于 2014-08-23 07:50:07

回答 1查看 4K关注 0票数 8

我很难理解如何在继承自CrawlSpider的蜘蛛中使用rules字段。我的蜘蛛正试图爬过旧金山的黄页披萨目录。

我一直试图保持我的规则简单，只是看看蜘蛛是否会爬过任何一个链接的反应，但我没有看到它发生。我唯一的结果是，它生成下一页的请求，然后生成对下一页的请求。

我有两个问题：1.是否在收到响应时调用回调之前先处理规则？反之亦然？2.什么时候适用这些规则？

编辑：，我想明白了。我重写了来自CrawlSpider的解析方法。在查看了该类中的解析方法之后，我意识到它检查规则并在这些网站上爬行。

注意:知道您要重写什么

这是我的密码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import Selector
from yellowPages.items import YellowpagesItem
from scrapy.http import Request

class YellowPageSpider(CrawlSpider):
    name = "yellowpages"
    allowed_domains = ['www.yellowpages.com']
    businesses = []

    # start with one page
    start_urls = ['http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza']

    rules = (Rule (SgmlLinkExtractor()
    , callback="parse_items", follow= True),
    )

    base_url = 'http://www.yellowpages.com'

    def parse(self, response):
        yield Request(response.url, callback=self.parse_business_listings_page)

    def parse_items(self, response):
        print "PARSE ITEMS. Visiting %s" % response.url
        return []

    def parse_business_listings_page(self, response):
        print "Visiting %s" % response.url

        self.businesses.append(self.extract_businesses_from_response(response))
        hxs = Selector(response)
        li_tags = hxs.xpath('//*[@id="main-content"]/div[4]/div[5]/ul/li')
        next_exist = False

        # Check to see if there's a "Next". If there is, store the links.
        # If not, return. 
        # This requires a linear search through the list of li_tags. Is there a faster way?
        for li in li_tags:
            li_text = li.xpath('.//a/text()').extract()
            li_data_page = li.xpath('.//a/@data-page').extract()
            # Note: sometimes li_text is an empty list so check to see if it is nonempty first
            if (li_text and li_text[0] == 'Next'):
                next_exist = True
                next_page_num = li_data_page[0]
                url = 'http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza&page='+next_page_num
                yield Request(url, callback=self.parse_business_listings_page)

python

scrapy

rules

web-crawler

回答 1

Stack Overflow用户

发布于 2017-03-25 17:12:44

所以说到你的两个问题。

在发出请求之前，先对爬虫规则进行处理，然后再发出请求.当然，如果响应不符合允许的域，则在理论上会接收响应，但只会被删除。

同样，在发出请求之前使用Crawler规则。

注意！

在您的示例中，当您调用解析()方法时..。但在你的情况下，你使用它对吗？？！必须运行它才能确认，但是，除非在爬行蜘蛛中显式重写解析()方法，否则将进行读取.当使用爬行蜘蛛..。蜘蛛中的pare与爬行器的等价形式是parse_item()。解析()在爬虫中是它自己的逻辑函数..。不应在RULESET中使用回调

https://doc.scrapy.org/en/latest/topics/spiders.html

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25459719

复制

相似问题

问理解Scrapy的CrawlSpider规则
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问理解Scrapy的CrawlSpider规则EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问理解Scrapy的CrawlSpider规则
EN