我正试图从谷歌游戏商店得到最新的评论。我跟踪这个问题是为了获得最新的评论,here
上面链接的答案中指定的方法可以很好地处理刮擦外壳,但是当我在我的爬虫中尝试这个方法时,它会被完全忽略。
代码片段:
import re
import sys
import time
import urllib
import urlparse
from scrapy import Spider
from scrapy.spider import BaseSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from play.items import PlayApp
class PlaySpider(CrawlSpider):
name = "play"
allowed_domains = ["play.google.com"]
start_urls = [
"https://play.google.com/store/apps"
]
rules = (
Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),
)
def parseCategory(self, response):
"""
gets categories from store home page call parseLinks for each category
"""
#something here......
yield Request(categoryapps, callback=self.parseLinks)
def parseLinks(self, response):
'''
get all the links from the category page and then
pasess individual links to parseApp function.
'''
#something here
yield Request(link, callback=self.parseApp)
def parseApp(self, response):
'''
parses apps page to get info about the app
'''
#application page parsing ......
frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
url = "https://play.google.com/store/getreviews"
yield FormRequest(url, callback=self.parse_data, formdata=frmdata)
yield app
def parse_data(self, response):
# do stuff with data...
print '\n\n---------------I am here------------------\n\n'这个函数parse_data从未被调用过。问这个#刮IRC和其他几个地方,但没有帮助。请帮我处理这个。
这是终端上的调试响应:
DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster)
2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.HindiTranslate)因此,POST请求确实会被发送,但回调方法不会被调用。
发布于 2015-06-03 09:52:01
似乎您没有在表单数据中更改id。
def parseApp(self, response):
apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))
url = "https://play.google.com/store/getreviews"
for app in apps:
_id = app.strip('/store/apps/details?id=')
form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
sleep(5)
yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)
def parse_app(self, response):
response_data = re.findall("\[\[.*", response.body)
if response_data:
try:
text = json.loads(response_data[0] + ']')
sell = Selector(text=text[0][2])
except:
pass
# do whatever you want to extract using sell.xapth('YOUR_XPATH_HERE')在清理完数据之后,您将得到如下内容:
<div class="single-review">
<a href="/store/people/details?id=106726831005267540508">
<img class="author-image" alt="Lorence Gerona avatar image" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48">
</a>
<div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9lw">
<div class="review-info">
<span class="author-name">
<a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a>
</span>
<span class="review-date">3 June 2015</span>
<a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&reviewId=Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" title="Link to this review"></a> <div class="review-source" style="display:none">
</div>
<div class="review-info-star-rating">
<div class="tiny-star star-rating-non-editable-container" aria-label="Rated 5 stars out of five stars">
<div class="current-rating" style="width: 100%;">
</div>
</div>
</div>
</div>
<div class="rate-review-wrapper">
<div class="play-button icon-button small rate-review" title="Spam" data-rating="SPAM">
<div class="icon spam-flag"></div>
</div>
<div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL">
<div class="icon thumbs-up"></div>
</div>
<div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"> <div class="icon thumbs-down"></div>
</div>
</div>
</div>
<div class="review-body">
<span class="review-title">Team BOOM BEACH</span>
Amazing game I can defeat hammerman
<div class="review-link" style="display:none">
<a class="id-no-nav play-button tiny" href="#" target="_blank">Full Review</a>
</div>
</div>
</div>https://stackoverflow.com/questions/30614560
复制相似问题