文章/答案/技术大牛

发布

社区首页 >问答首页 >用Scrapy管道处理项目

问用Scrapy管道处理项目
EN

Stack Overflow用户

提问于 2016-09-29 19:01:33

回答 1查看 3.3K关注 0票数 0

我从Python脚本运行Scrapy。

我被告知，在Scrapy中，response是在parse()中构建的，并且是在pipeline.py中进一步处理的。

到目前为止，我的框架是这样设置的：

Python脚本

def script(self):

        process = CrawlerProcess(get_project_settings())

        response = process.crawl('pitchfork_albums', domain='pitchfork.com')

        process.start() # the script will block here until the crawling is finished

蜘蛛

class PitchforkAlbums(scrapy.Spider):
    name = "pitchfork_albums"
    allowed_domains = ["pitchfork.com"]
    #creates objects for each URL listed here
    start_urls = [
                    "http://pitchfork.com/reviews/best/albums/?page=1",
                    "http://pitchfork.com/reviews/best/albums/?page=2",
                    "http://pitchfork.com/reviews/best/albums/?page=3"                   
    ]
    def parse(self, response):

        for sel in response.xpath('//div[@class="album-artist"]'):
            item = PitchforkItem()
            item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
            item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()

        yield item

items.py

class PitchforkItem(scrapy.Item):

    artist = scrapy.Field()
    album = scrapy.Field()

settings.py

ITEM_PIPELINES = {
   'blogs.pipelines.PitchforkPipeline': 300,
}

pipelines.py

class PitchforkPipeline(object):

    def __init__(self):
        self.file = open('tracks.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        for i in item:
            return i['album'][0]

如果我只是在return item中使用pipelines.py，就会得到类似的数据(每个html页面有一个response )：

{'album': [u'Sirens',
           u'I Had a Dream That You Were Mine',
           u'Sunergy',
           u'Skeleton Tree',
           u'My Woman',
           u'JEFFERY',
           u'Blonde / Endless',
           u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
           u'HEAVN',
           u'Blank Face LP',
           u'blackSUMMERS\u2019night',
           u'Wildflower',
           u'Freetown Sound',
           u'Trans Day of Revenge',
           u'Puberty 2',
           u'Light Upon the Lake',
           u'iiiDrops',
           u'Teens of Denial',
           u'Coloring Book',
           u'A Moon Shaped Pool',
           u'The Colour in Anything',
           u'Paradise',
           u'HOPELESSNESS',
           u'Lemonade'],
 'artist': [u'Nicolas Jaar',
            u'Hamilton Leithauser',
            u'Rostam',
            u'Kaitlyn Aurelia Smith',
            u'Suzanne Ciani',
            u'Nick Cave & the Bad Seeds',
            u'Angel Olsen',
            u'Young Thug',
            u'Frank Ocean',
            u'Elza Soares',
            u'Jamila Woods',
            u'Schoolboy Q',
            u'Maxwell',
            u'The Avalanches',
            u'Blood Orange',
            u'G.L.O.S.S.',
            u'Mitski',
            u'Whitney',
            u'Joey Purp',
            u'Car Seat Headrest',
            u'Chance the Rapper',
            u'Radiohead',
            u'James Blake',
            u'White Lung',
            u'ANOHNI',
            u'Beyonc\xe9']}

我想在pipelines.py中做的是能够为每个item获取单独的songs，如下所示：

[u'Sirens']

python

scrapy

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-09-29 19:33:43

我建议您在蜘蛛中构建结构良好的item。在Scrapy工作流中，蜘蛛用于构建格式良好的项，例如解析html、填充项实例和管道用于对项进行操作，例如筛选项、存储项。

对于您的应用程序，如果我正确理解，每个项目都应该是描述相册的条目。因此，在解析html时，您最好构建这样的项目，而不是将所有内容集中到item中。

所以在您的spider.py，parse函数中，您应该

yield item将语句放在 for 循环中，而不是在之外。这样，每个相册都会生成一个项目。
在Scrapy中要小心相对xpath选择器。如果您想使用相对xpath选择器来指定自和后代，使用.//代替//，并指定self，则使用./而不是/。
理想情况下，专辑标题应该是标量，专辑艺术家应该是一个列表，所以尝试extract_first使专辑标题成为标量。 def解析(self，response)：for sel in sel.xpath('./ul@class="artist-list"/li/text()').extract_first() (‘//div@class=“相册艺术家”)：item = PitchforkItem() item’parse‘=response.xpath item’相册‘= sel.xpath('./h2@class="title"/text()').extract()产项

希望这会有帮助。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39778086

复制

相似问题

问用Scrapy管道处理项目
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Scrapy管道处理项目EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Scrapy管道处理项目
EN