首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >抓取动态内容的抓取

抓取动态内容的抓取
EN

Stack Overflow用户
提问于 2015-06-03 08:20:38
回答 1查看 1.3K关注 0票数 3

我正试图从谷歌游戏商店得到最新的评论。我跟踪这个问题是为了获得最新的评论,here

上面链接的答案中指定的方法可以很好地处理刮擦外壳,但是当我在我的爬虫中尝试这个方法时,它会被完全忽略。

代码片段:

代码语言:javascript
复制
import re
import sys
import time
import urllib
import urlparse

from scrapy import Spider
from scrapy.spider import BaseSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

from play.items import PlayApp

class PlaySpider(CrawlSpider):
    name = "play"
    allowed_domains = ["play.google.com"]
    start_urls = [
            "https://play.google.com/store/apps"
        ]

    rules = (
        Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),
    )

    def parseCategory(self, response):
        """
            gets categories from store home page call parseLinks for each category
        """
        #something here......
        yield Request(categoryapps, callback=self.parseLinks)

    def parseLinks(self, response):

        '''
        get all the links from the category page and then 
        pasess individual links to parseApp function.
        '''    
        #something here

        yield Request(link, callback=self.parseApp)

    def parseApp(self, response):

        '''
        parses apps page to get info about the app
        '''

        #application page parsing ......        

        frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        url = "https://play.google.com/store/getreviews"
        yield FormRequest(url, callback=self.parse_data, formdata=frmdata)

        yield app

    def parse_data(self, response):
        # do stuff with data...
        print '\n\n---------------I am here------------------\n\n'

这个函数parse_data从未被调用过。问这个#刮IRC和其他几个地方,但没有帮助。请帮我处理这个。

这是终端上的调试响应:

代码语言:javascript
复制
DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster)
2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.HindiTranslate)

因此,POST请求确实会被发送,但回调方法不会被调用。

EN

回答 1

Stack Overflow用户

发布于 2015-06-03 09:52:01

似乎您没有在表单数据中更改id

代码语言:javascript
复制
def parseApp(self, response):
    apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))
    url = "https://play.google.com/store/getreviews"
    for app in apps:
        _id = app.strip('/store/apps/details?id=')
        form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        sleep(5)
        yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)

def parse_app(self, response):
    response_data = re.findall("\[\[.*", response.body)
    if response_data:
        try:
            text = json.loads(response_data[0] + ']')
            sell = Selector(text=text[0][2])
        except:
            pass
        # do whatever you want to extract using sell.xapth('YOUR_XPATH_HERE')

在清理完数据之后,您将得到如下内容:

代码语言:javascript
复制
<div class="single-review">
    <a href="/store/people/details?id=106726831005267540508">
        <img class="author-image" alt="Lorence Gerona avatar image" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48">
    </a>
    <div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9lw">
        <div class="review-info">
            <span class="author-name">
                <a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a>
            </span>
            <span class="review-date">3 June 2015</span>
            <a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&amp;reviewId=Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" title="Link to this review"></a> <div class="review-source" style="display:none">

        </div>
        <div class="review-info-star-rating">
            <div class="tiny-star star-rating-non-editable-container" aria-label="Rated 5 stars out of five stars">
                <div class="current-rating" style="width: 100%;">

                </div>
            </div>
        </div>
    </div>
    <div class="rate-review-wrapper">
        <div class="play-button icon-button small rate-review" title="Spam" data-rating="SPAM">
            <div class="icon spam-flag"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL">
            <div class="icon thumbs-up"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"> <div class="icon thumbs-down"></div>
    </div>
</div>
</div>
<div class="review-body">
<span class="review-title">Team BOOM BEACH</span>
Amazing game I can defeat hammerman
<div class="review-link" style="display:none">
    <a class="id-no-nav play-button tiny" href="#" target="_blank">Full Review</a>
</div>
</div>
</div>
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/30614560

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档