首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >刮擦项产生重复值。

刮擦项产生重复值。
EN

Stack Overflow用户
提问于 2016-05-05 15:26:57
回答 1查看 1.1K关注 0票数 0

我希望能在这个问题上朝正确的方向努力。

下面是一只蜘蛛:

  1. 爬行列表页并检索每个记录的摘要信息(10行/页)
  2. 按照URL提取每个记录页面上的详细信息
  3. 转到下一个列表页面

问题:每个记录的详细信息都很好,但是每条记录都包含来自同一个列表页面的最后一条记录的摘要信息。

简化示例:

代码语言:javascript
复制
URL    DA     Detail1        Detail2
9      9      0              0
9      9      1              1
9      9      2              2
9      9      3              3
9      9      4              4
9      9      5              5
9      9      6              6
9      9      7              7
9      9      8              8
9      9      9              9

使用scrapy shell,我可以手动迭代并获得正确的值,如下所示:

代码语言:javascript
复制
import scrapy
from cbury_scrapy.items import DA

for row in response.xpath('//table/tr[@class="datrack_resultrow_odd" or @class="datrack_resultrow_even"]'):
    r = scrapy.Selector(text=row.extract(), type="html")
    print r.xpath('//td[@class="datrack_danumber_cell"]//text()').extract_first(), r.xpath('//td[@class="datrack_danumber_cell"]//@href').extract_first()[-5:]

输出

代码语言:javascript
复制
SC-18/2016 HQQM=
DA-190/2016 HQwQ=
DA-192/2016 HQAk=
S68-122/2016 HQgM=
DA-191/2016 HQgc=
DA-223/2015/A HQQY=
DA-81/2016/A GSgY=
PCA-111/2016 GSwU=
PCD-101/2016 GSwM=
PCD-100/2016 GRAc=

当爬行器运行时,最后一个记录摘要详细信息将在同一列表页上对每个记录重复。请看下面的爬行器,这个冒犯的方法似乎是parse方法的前10行。

代码语言:javascript
复制
""" Run under bash with:
timenow=`date +%Y%m%d_%H%M%S`; scrapy runspider cbury_spider.py -o cbury-scrape-$timenow.csv
Problems? Interactively check Xpaths etc.:
scrapy shell "http://datrack.canterbury.nsw.gov.au/cgi/datrack.pl?search=search&sortfield=^metadata.date_lodged""""
import scrapy
from cbury_scrapy.items import DA

def td_text_after(label, response):
    """ retrieves text from first td following a td containing a label e.g.:"""
    return response.xpath("//*[contains(text(), '" + label + "')]/following-sibling::td//text()").extract_first()

class CburySpider(scrapy.Spider):
    # scrapy.Spider attributes
    name = "cbury"
    allowed_domains = ["datrack.canterbury.nsw.gov.au"]
    start_urls = ["http://datrack.canterbury.nsw.gov.au/cgi/datrack.pl?search=search&sortfield=^metadata.date_lodged",]
# required for unicode character replacement of '$' and ',' in est_cost
translation_table = dict.fromkeys(map(ord, '$,'), None)
da = DA()        
da['lga'] = u"Canterbury"


def parse(self, response):
    """ Retrieve DA no., URL and address for DA on summary list page """
    for row in response.xpath('//table/tr[@class="datrack_resultrow_odd" or @class="datrack_resultrow_even"]'):
        r = scrapy.Selector(text=row.extract(), type="html")
        self.da['da_no'] = r.xpath('//td[@class="datrack_danumber_cell"]//text()').extract_first()
        self.da['house_no'] = r.xpath('//td[@class="datrack_houseno_cell"]//text()').extract_first()
        self.da['street'] = r.xpath('//td[@class="datrack_street_cell"]//text()').extract_first()
        self.da['town'] = r.xpath('//td[@class="datrack_town_cell"]//text()').extract_first()
        self.da['url'] = r.xpath('//td[@class="datrack_danumber_cell"]//@href').extract_first()

        # then retrieve remaining DA details from the detail page
        yield scrapy.Request(self.da['url'], callback=self.parse_da_page)

    # follow next page link if one exists
    next_page = response.xpath("//*[contains(text(), 'Next')]/@href").extract_first()
    if next_page:
        yield scrapy.Request(next_page, self.parse)


def parse_da_page(self, response):  
    """ Retrieve DA information from its detail page """        
    labels = { 'date_lodged': 'Date Lodged:', 'desc_full': 'Description:', 
               'est_cost': 'Estimated Cost:', 'status': 'Status:',
               'date_determined': 'Date Determined:', 'decision': 'Decision:',
               'officer': 'Responsible Officer:' }

    # map DA fields with those in the following <td> elements on the page
    for i in labels:
        self.da[i] = td_text_after(labels[i], response)

    # convert est_cost text to int for easier sheet import "12,000" -> 12000
    if self.da['est_cost'] != None:
        self.da['est_cost'] = int(self.da['est_cost'].translate(self.translation_table))

    # Get people data from 'Names' table with 'Role' heading
    self.da['names'] = []
    for row in response.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):    
        da_name = {}
        da_name['role'] = row.xpath('normalize-space(./td[1])').extract_first()            
        da_name['name_no'] = row.xpath('normalize-space(./td[2])').extract_first()
        da_name['full_name'] = row.xpath('normalize-space(./td[3])').extract_first()
        self.da['names'].append(da_name)

    yield self.da

你的帮助将不胜感激。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-05-06 08:15:57

Scrapy是异步的,一旦您提交了一个请求,就无法保证该请求何时会被激活。正因为如此,您的self.da不可靠地将数据传递给parse_da_page。相反,在解析例程中创建da_items = DA(),并将其作为meta传递到请求中。

代码语言:javascript
复制
for row in response.xpath(...):
    da_items = DA()
    da_items['street'] = row.xpath(...)
    ...
    da_items['url'] = row.xpath(...)
    yield scrapy.Request(da_items['url'], callback=self.parse_da_page, meta=da_items)

然后在parse_da_page中,您可以使用response.meta['street']等检索这些值。看看这里的医生

还请注意,您的行r = scrapy.Selector(text=row.extract(), type="html")是多余的,您可以直接使用变量row,就像我在上面的示例中所做的那样。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/37054425

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档