文章/答案/技术大牛

发布

社区首页 >问答首页 >在爬行网页时有相同数据的问题

问在爬行网页时有相同数据的问题
EN

Stack Overflow用户

提问于 2019-03-26 08:02:21

回答 1查看 42关注 0票数 0

我正在尝试抓取一个网页，以获得该网页的评论和评级。但是我得到的数据和输出数据是一样的。

import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        for i in range(1, 10):
            url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
            print(url)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(json.dumps({'rating': response.xpath("//div[@class='star-rating__score']").xpath("@style").extract(),
               'review': response.xpath("//p[@class='fan-reviews__item-content']/text()").getall()}))

预期:爬行1000页的网站https://www.fandango.com/aquaman-208499/movie-reviews

实际产出：

https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

scrapy

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-03-27 11:55:53

评论是使用JavaScript动态填充的。在这样的情况下，你必须检查网站的请求。

获得用户评论的URL如下：

https://www.fandango.com/napi/fanReviews/208499/1/5

它返回一个有5个评论的json。

你的蜘蛛可以这样重写：

import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        movie_id = "208499"
        for page in range(1, 10):
            # You have to pass the referer, otherwise the site returns a 403 error
            headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
            url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
            yield scrapy.Request(url=url, callback=self.parse, headers=headers)

    def parse(self, response):
        data = json.loads(response.text)
        for review in data['data']:
            yield review

请注意，我也使用产而不是打印来提取项目，这就是Scrapy项目生成的方式。您可以像这样运行这个蜘蛛来将提取的项导出到一个文件中：

scrapy crawl rate -o outputfile.json

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55352327

复制

相似问题

问在爬行网页时有相同数据的问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在爬行网页时有相同数据的问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在爬行网页时有相同数据的问题
EN