首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >刮除蜘蛛,不包括所有请求的页面

刮除蜘蛛,不包括所有请求的页面
EN

Stack Overflow用户
提问于 2014-04-26 01:26:45
回答 3查看 1.4K关注 0票数 2

我为Yelp编写了一个Scrapy脚本,这在很大程度上是在工作。从本质上说,我可以提供一个Yelp页面列表,它应该返回所有页面的所有评论。到目前为止,脚本如下:

代码语言:javascript
复制
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import re

from yelp2.items import YelpReviewItem

RESTAURANTS = ['sixteen-chicago']

def createRestaurantPageLinks(self, response):
    reviewsPerPage = 40
    sel = Selector(response)
    totalReviews = int(sel.xpath('//div[@class="rating-info clearfix"]//span[@itemprop="reviewCount"]/text()').extract()[0].strip().split(' ')[0])
    pages = [Request(url=response.url + '?start=' + str(reviewsPerPage*(n+1)), callback=self.parse) for n in range(totalReviews/reviewsPerPage)]
    return pages

class Yelp2aSpider(Spider):
    name = "yelp2a"
    allowed_domains = ["yelp.com"]
    start_urls = ['http://www.yelp.com/biz/%s' % s for s in RESTAURANTS]

    def parse(self, response):
        requests = []

        sel = Selector(response)
        reviews = sel.xpath('//div[@class="review review-with-no-actions"]')
        items = []

        for review in reviews:
            item = YelpReviewItem()
                item['venueName'] = sel.xpath('//meta[@property="og:title"]/@content').extract()
                item['reviewer'] = review.xpath('.//li[@class="user-name"]/a/text()').extract()
                item['reviewerLoc'] = review.xpath('.//li[@class="user-location"]/b/text()').extract()
                item['rating'] = review.xpath('.//meta[@itemprop="ratingValue"]/@content').extract()
                item['reviewDate'] = review.xpath('.//meta[@itemprop="datePublished"]/@content').extract()
                item['reviewText'] = review.xpath('.//p[@itemprop="description"]/text()').extract()
                item['url'] = response.url
                items.append(item)
        return items

        if response.url.find('?start=') == -1:
            requests += createRestaurantPageLinks(self, response)
            return requests

但是,我遇到的问题是,除了第一页之外,这个特定的脚本会刮掉每个请求的评论的每一页。如果我注释掉最后一个" If“语句,它只会擦伤第一页。我想我只需要一个简单的“否则”命令,但我很困惑.我们非常感谢你的帮助!

编辑:,这是目前基于收到的援助的代码.

代码语言:javascript
复制
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import re

from yelp2.items import YelpReviewItem

RESTAURANTS = ['sixteen-chicago']

def createRestaurantPageLinks(self, response):
    reviewsPerPage = 40
    sel = Selector(response)
    totalReviews = int(sel.xpath('//div[@class="rating-info clearfix"]//span[@itemprop="reviewCount"]/text()').extract()[0].strip().split(' ')[0])
    pages = [Request(url=response.url + '?start=' + str(reviewsPerPage*(n+1)), callback=self.parse) for n in range(totalReviews/reviewsPerPage)]
    return pages

class Yelp2aSpider(Spider):
    name = "yelp2a"
    allowed_domains = ["yelp.com"]
    start_urls = ['http://www.yelp.com/biz/%s' % s for s in RESTAURANTS]

    def parse(self, response):
        requests = []

        sel = Selector(response)
        reviews = sel.xpath('//div[@class="review review-with-no-actions"]')
        items = []

        for review in reviews:
            item = YelpReviewItem()
            item['venueName'] = sel.xpath('//meta[@property="og:title"]/@content').extract()
            item['reviewer'] = review.xpath('.//li[@class="user-name"]/a/text()').extract()
            item['reviewerLoc'] = review.xpath('.//li[@class="user-location"]/b/text()').extract()
            item['rating'] = review.xpath('.//meta[@itemprop="ratingValue"]/@content').extract()
            item['reviewDate'] = review.xpath('.//meta[@itemprop="datePublished"]/@content').extract()
            item['reviewText'] = review.xpath('.//p[@itemprop="description"]/text()').extract()
            item['url'] = response.url
        yield item

        if response.url.find('?start=') == -1:
            requests += createRestaurantPageLinks(self, response)
            for request in requests:
                    yield request

正如下面的注释中所提到的,运行这段代码是为了爬行每一个想要的页面,但是它只返回每个页面的一个评论,而不是所有的。

我尝试将yield item更改为yield items,但是对于每个爬行的URL,都会返回ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.yelp.com/biz/[...]>的错误消息。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2014-05-29 03:03:50

最后的答案确实在于一条yield行的缩进。这是代码,最终完成了我需要它做的事情。

代码语言:javascript
复制
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import re

from yelp2.items import YelpReviewItem

RESTAURANTS = ['sixteen-chicago']

def createRestaurantPageLinks(self, response):
    reviewsPerPage = 40
    sel = Selector(response)
    totalReviews = int(sel.xpath('//div[@class="rating-info clearfix"]//span[@itemprop="reviewCount"]/text()').extract()[0].strip().split(' ')[0])
    pages = [Request(url=response.url + '?start=' + str(reviewsPerPage*(n+1)), callback=self.parse) for n in range(totalReviews/reviewsPerPage)]
    return pages

class YelpXSpider(Spider):
    name = "yelpx"
    allowed_domains = ["yelp.com"]
    start_urls = ['http://www.yelp.com/biz/%s' % s for s in RESTAURANTS]

    def parse(self, response):
        requests = []

        sel = Selector(response)
        reviews = sel.xpath('//div[@class="review review-with-no-actions"]')
        items = []

        for review in reviews:
            item = YelpReviewItem()
            item['venueName'] = sel.xpath('//meta[@property="og:title"]/@content').extract()
            item['reviewer'] = review.xpath('.//li[@class="user-name"]/a/text()').extract()
            item['reviewerLoc'] = review.xpath('.//li[@class="user-location"]/b/text()').extract()
            item['rating'] = review.xpath('.//meta[@itemprop="ratingValue"]/@content').extract()
            item['reviewDate'] = review.xpath('.//meta[@itemprop="datePublished"]/@content').extract()
            item['reviewText'] = review.xpath('.//p[@itemprop="description"]/text()').extract()
            item['url'] = response.url
            yield item

        if response.url.find('?start=') == -1:
            requests += createRestaurantPageLinks(self, response)
            for request in requests:
                    yield request

感谢大家的帮助!

票数 0
EN

Stack Overflow用户

发布于 2014-04-26 01:50:09

您需要稍微重新组织这些方法。首先在parse()方法中解析餐厅页面。然后,返回对评审的请求,并在另一种方法中处理响应,例如parse_review()

代码语言:javascript
复制
import re

from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request

from yelp2.items import YelpReviewItem


RESTAURANTS = ['sixteen-chicago']

class Yelp2aSpider(Spider):
    name = "yelp2a"
    allowed_domains = ["yelp.com"]
    start_urls = ['http://www.yelp.com/biz/%s' % s for s in RESTAURANTS]

    def parse(self, response):
        reviewsPerPage = 40
        sel = Selector(response)
        totalReviews = int(sel.xpath('//div[@class="rating-info clearfix"]//span[@itemprop="reviewCount"]/text()').extract()[0].strip().split(' ')[0])
        pages = [Request(url=response.url + '?start=' + str(reviewsPerPage*(n+1)), callback=self.parse_review) for n in range(totalReviews/reviewsPerPage)]
        return pages

    def parse_review(self, response):
        sel = Selector(response)
        reviews = sel.xpath('//div[@class="review review-with-no-actions"]')
        for review in reviews:
            item = YelpReviewItem()
            item['venueName'] = sel.xpath('//meta[@property="og:title"]/@content').extract()
            item['reviewer'] = review.xpath('.//li[@class="user-name"]/a/text()').extract()
            item['reviewerLoc'] = review.xpath('.//li[@class="user-location"]/b/text()').extract()
            item['rating'] = review.xpath('.//meta[@itemprop="ratingValue"]/@content').extract()
            item['reviewDate'] = review.xpath('.//meta[@itemprop="datePublished"]/@content').extract()
            item['reviewText'] = review.xpath('.//p[@itemprop="description"]/text()').extract()
            item['url'] = response.url
            yield item
票数 1
EN

Stack Overflow用户

发布于 2014-04-26 01:50:50

如果您在多个地方返回项/请求,则应该用return语句替换您的yield语句,这些语句会将您的函数转换为生成器,该生成器每次生成它时都会返回一个新元素(产生它),而不会退出该函数,直到它们全部返回。否则,就像现在的代码一样,您的函数将在第一个return之后退出,不会发送以下页面的请求。

编辑:更正-您应该一次生成一个项目/请求,因此:

替换

代码语言:javascript
复制
for review in reviews:
    item = ...
return items

使用

代码语言:javascript
复制
for review in reviews:
    item = ...
    yield item

并替换

代码语言:javascript
复制
return requests

使用

代码语言:javascript
复制
for request in requests:
    yield request
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/23305502

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档