首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在爬行网页时有相同数据的问题

在爬行网页时有相同数据的问题
EN

Stack Overflow用户
提问于 2019-03-26 08:02:21
回答 1查看 42关注 0票数 0

我正在尝试抓取一个网页,以获得该网页的评论和评级。但是我得到的数据和输出数据是一样的。

代码语言:javascript
复制
import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        for i in range(1, 10):
            url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
            print(url)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(json.dumps({'rating': response.xpath("//div[@class='star-rating__score']").xpath("@style").extract(),
               'review': response.xpath("//p[@class='fan-reviews__item-content']/text()").getall()}))

预期:爬行1000页的网站https://www.fandango.com/aquaman-208499/movie-reviews

实际产出:

代码语言:javascript
复制
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-03-27 11:55:53

评论是使用JavaScript动态填充的。在这样的情况下,你必须检查网站的请求。

获得用户评论的URL如下:

https://www.fandango.com/napi/fanReviews/208499/1/5

它返回一个有5个评论的json。

你的蜘蛛可以这样重写:

代码语言:javascript
复制
import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        movie_id = "208499"
        for page in range(1, 10):
            # You have to pass the referer, otherwise the site returns a 403 error
            headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
            url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
            yield scrapy.Request(url=url, callback=self.parse, headers=headers)

    def parse(self, response):
        data = json.loads(response.text)
        for review in data['data']:
            yield review

请注意,我也使用产而不是打印来提取项目,这就是Scrapy项目生成的方式。您可以像这样运行这个蜘蛛来将提取的项导出到一个文件中:

scrapy crawl rate -o outputfile.json

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/55352327

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档