文章/答案/技术大牛

发布

社区首页 >问答首页 >即使在收集和解析链接时，Scrapy爬虫也不跟踪它们。

问即使在收集和解析链接时，Scrapy爬虫也不跟踪它们。
EN

Stack Overflow用户

提问于 2016-06-17 18:45:32

回答 1查看 154关注 0票数 0

我在这个问题上被困了好几个小时了。我无法使用规则语法跟踪这个站点上的链接，所以我手动找到了发出请求所需的所有链接。即使我测试了提取的链接是有效的urls，我的爬虫不爬行额外的页面。我被困在这上面好几个小时了。我也不认为关于Scrapy的文档有那么大的帮助，因为它是以完美的文字卡片呈现的。有人能帮忙吗？

# -*- coding: utf-8 -*-
import scrapy
import logging
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Request

from banker.items import BarclaysOfferItem

class BarclaySpider(CrawlSpider):
    name = "barclay"
    allowed_domains = ['partners.barclaycardrewardsboost.com/']
    start_urls = [
        'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=1&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers'   
        # 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=2&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers'    
        # 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=3&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers',    
        # 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=4&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers',    
        # 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=5&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers',    
        # 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=6&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers',    
        # 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=7&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers'    

    ]

    def parse(self, response):

        base = 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm'
        links = response.xpath('//p[contains(@class, "mn_pageLinks")]/a')

        for sel in links:
            url = base + str(sel.xpath('@href').extract()[0])

            logging.info(url)

            yield scrapy.Request(url, callback=self.parse_item)


    def parse_item(self, reponse):
        for sel in response.xpath('//table/tr'):
            item = BarclaysOfferItem()
            item['merchant'] = sel.xpath('td/div/a[last()]/text()').extract()
            item['rate'] = sel.xpath('td/span/a/text()').extract()
            item['offer'] = sel.xpath('td/a[last()]/text()').extract()
            item['coupon_code'] = sel.xpath('td[@class="mn_cpCode"]/text()').extract()
            item['expiration_date'] = sel.xpath('td[@class="mn_expiry"]/text()').extract()
            yield item

更新#1

删除allowed_urls列表使我的请求生效。然而，现在我一直得到NameError: global name 'response' is not defined。

python

scrapy

web-crawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-06-17 19:29:19

我终于让它起作用了！

根据刮伤文件，如果启用了OffsiteMiddleware，则不会为请求读取allowed_domains列表中的域。我知道我的urls位于指定的域上，但我认为网站查询数据的方式会使urls看起来就像在离站点一样。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider, Rule
from scrapy.linkextractors import LinkExtractor
import logging
from banker.items import BarclaysOfferItem

class BarclaySpider(Spider):
    name = "barclay"
    start_urls = [
        'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm?rows=25&page=1&ref_page_id=2167&ref_section_id=9720&ref_section_title=All%\20Online%\20Offers'  

    ]

    # Parse for the links of interest
    def parse(self, response):

        base = 'https://partners.barclaycardrewardsboost.com/shopping/sp____.htm'
        links = response.xpath('//p[contains(@class, "mn_pageLinks")]/a')
        for sel in links:
            url = base + str(sel.xpath('@href').extract()[0])
            logging.info(url)
            yield scrapy.Request(url, callback=self.parse_item)    

    # parse for the items of interest
    def parse_item(self, response):
        for sel in response.xpath('//table/tr'):
            item = BarclaysOfferItem()
            item['merchant'] = sel.xpath('td/div/a[last()]/text()').extract()
            item['rate'] = sel.xpath('td/span/a/text()').extract()
            item['offer'] = sel.xpath('td/a[last()]/text()').extract()
            item['coupon_code'] = sel.xpath('td[@class="mn_cpCode"]/text()').extract()
            item['expiration_date'] = sel.xpath('td[@class="mn_expiry"]/text()').extract()
            yield item

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37888374

复制

相似问题

问即使在收集和解析链接时，Scrapy爬虫也不跟踪它们。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问即使在收集和解析链接时，Scrapy爬虫也不跟踪它们。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问即使在收集和解析链接时，Scrapy爬虫也不跟踪它们。
EN