文章/答案/技术大牛

发布

社区首页 >问答首页 >Scrapy-splash找不到图像源url

问Scrapy-splash找不到图像源url
EN

Stack Overflow用户

提问于 2021-05-14 19:36:27

回答 3查看 193关注 0票数 1

我正在尝试从ZARA上抓取一个产品页面。就像这个:https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115

我的scrapy-splash容器正在运行。在shell中，我获取页面

fetch('http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115')
2021-05-14 14:30:42 [scrapy.core.engine] INFO: Spider opened
2021-05-14 14:30:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115> (referer: None)

到目前为止，一切都正常，我可以得到标题和价格。然而，我想得到的产品的图像网址。

我试着通过

response.css('img.media-image__image::attr(src)').getall()

但回应是：

['https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png']

这都是背景图片而不是真正的图片。我可以在浏览器上显示图像，并且可以看到网络请求中出现的图像。是因为它加载了AJAX请求吗？我该如何解决这个问题？

python

web-scraping

scrapy

scrapy-splash

回答 3

Stack Overflow用户

回答已采纳

发布于 2021-05-19 17:47:15

@samuelhogg是找到json的功劳，但这里有一个示例爬行器，展示了如何从页面中获取所有图像urls。请注意，你甚至不需要在这里使用splash，我没有用splash测试它，但我认为它应该仍然可以工作。

from scrapy import Spider
import json


class Zara(Spider):
    name = "zara"
    start_urls = [
        "https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115"
    ]
  
    def parse(self, response):
        # Find the json identified by @samuelhogg
        data = response.css("script[type='application/ld+json']::text").get()
        # Make a set of all the images in the json
        images = {image for i in json.loads(data) for image in i["image"]}
        # Do what you want with them!
        print(images)

票数 2

Stack Overflow用户

发布于 2021-05-17 03:42:01

我上个星期才开始研究网络抓取，所以我不确定我能不能帮上忙，但我确实找到了一些东西。

源代码在顶部的脚本中显示了以下内容：

_mkt_imageDir = /BASE_IMAGES_URL=(.*?);/.test(document.cookie) && RegExp.$1 || 'https://static.zara.net/photos/';

再往下看：

"originalUrl":"/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115","imageBaseUrl":"https://static.zara.net/photos/"

然后这里的所有图像看起来都是javascript：

[{"@context":"http://schema.org/","@type":"Product","sku":"108967877-046-1","name":"FITTED HOUNDSTOOTH BLAZER","mpn":"108967877-046-1","brand":"ZARA","description":"","image":["https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_1_1_1.jpg?ts=1620821843383","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_2_1_1.jpg?ts=1620821851988","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_2_2_1.jpg?ts=1620821839280","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_1_1.jpg?ts=1620655538200","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_2_1.jpg?ts=1620655535611","https://static.zara.net/photos///2021/I/0/1/p/7808/160/046/2/w/1920/7808160046_6_3_1.jpg?ts=1620656141718","https://static.zara.net/photos///contents/cm/w/1920/sustainability-extrainfo-label-JL78_0.jpg?ts=1602602200357"]

我不知道你将如何抓取它们，但当你找到答案时，我会很感兴趣。

向Samuel致敬

票数 2

Stack Overflow用户

发布于 2021-05-17 04:26:20

看起来urls是在一个json文件中，我相信你可以从这个文件中抓取urls。json

有一些关于从json here抓取的信息/代码

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67533691

复制

相似问题

问Scrapy-splash找不到图像源url
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scrapy-splash找不到图像源urlEN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scrapy-splash找不到图像源url
EN