首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Scrapy回调函数多次返回相同的结果

Scrapy回调函数多次返回相同的结果
EN

Stack Overflow用户
提问于 2016-03-10 01:24:08
回答 1查看 263关注 0票数 1

我是Scrapy的新手,我不能设法让回调函数工作。我设法获得所有的urls,并设法在回调函数中遵循它们,但当我得到结果时,我多次收到一些结果,许多结果丢失了。什么地方出问题了?

代码语言:javascript
复制
import scrapy

from kexcrawler.items import KexcrawlerItem

class KexSpider(scrapy.Spider):
    name = 'kex'
    allowed_domains = ["kth.diva-portal.org"]
    start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all']

def parse(self, response):
    for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
    item = KexcrawlerItem()
    item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract()
        yield item

以下是结果的前几行:

代码语言:javascript
复制
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]},
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
EN

回答 1

Stack Overflow用户

发布于 2016-03-11 07:56:53

我试图复制你的错误,但失败了。所有的urls都是不同的。我在信息级别记录了每个项目,并隐藏了下面的所有内容,发现每个报告也是独一无二的。我确实取消了您的your调用的缩进,因为它抛出了一个错误,并使用一个字段定义了您的item类。如果您直接从终端复制和粘贴,那么我假设它是打印的结果,而不是日志,这让我认为您可能有多个打印调用,这些调用在不同的时间被调用。尝试在某个地方写入这些文件,并查看是否确实存在重复的文件。为了测试urls是否唯一,我从xpath中将元素提取到一个名为elem的列表中,然后:print len(elem) b = set() for e in elem: b.add(e) print len(b)您可以尝试创建一个全局项目列表,然后添加一个函数spider_closed,该函数在关闭时将自动调用,然后在该列表上执行相同的操作。集合只包含唯一的元素,如果存在差异,那么您实际上是在创建重复的元素。

票数 -1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/35898421

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档