我正在尝试抓取一个网页,以获得该网页的评论和评级。但是我得到的数据和输出数据是一样的。
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
for i in range(1, 10):
url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(json.dumps({'rating': response.xpath("//div[@class='star-rating__score']").xpath("@style").extract(),
'review': response.xpath("//p[@class='fan-reviews__item-content']/text()").getall()}))预期:爬行1000页的网站https://www.fandango.com/aquaman-208499/movie-reviews
实际产出:
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}发布于 2019-03-27 11:55:53
评论是使用JavaScript动态填充的。在这样的情况下,你必须检查网站的请求。
获得用户评论的URL如下:
它返回一个有5个评论的json。
你的蜘蛛可以这样重写:
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
movie_id = "208499"
for page in range(1, 10):
# You have to pass the referer, otherwise the site returns a 403 error
headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
yield scrapy.Request(url=url, callback=self.parse, headers=headers)
def parse(self, response):
data = json.loads(response.text)
for review in data['data']:
yield review请注意,我也使用产而不是打印来提取项目,这就是Scrapy项目生成的方式。您可以像这样运行这个蜘蛛来将提取的项导出到一个文件中:
scrapy crawl rate -o outputfile.json
https://stackoverflow.com/questions/55352327
复制相似问题