我想刮一下这个地址的“大小”部分的javascript列表:
us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119
我想做的是得到库存的尺寸,它会返回一个列表。我怎么能做到呢?
这是我的完整代码:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
#sizes = ??
yield {
'name' : name,
'price' : price,
'sizes' : sizes
}谢谢
发布于 2017-03-06 06:02:54
以下是提取库存大小的代码。
import scrapy
class ShoesSpider(scrapy.Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
for s in sizes:
size = s.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract_first('').strip()
yield{'Size':size}结果如下:
男4/女5.5
男4.5 /女6
男性6.5 /8
男性7/W 8.5
M7.5/w 9
男性8/W 9.5
M8.5/W 10
M9/W 10.5
在for循环中,如果我们这样写它,它将提取所有的大小,不管它们是否有库存。
size = s.xpath('text()').extract_first('').strip()但是,如果您只想获得那些仅存的,它们将被标记为“exp-pdp-size-”类,您必须通过添加以下内容来排除该类:
[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]我在其他的鞋页上测试过它,它也能工作。
发布于 2017-03-04 15:16:13
大小由AJAX调用加载。
因此,您必须对该AJAX提出另一个请求,以便刮取大小。
这是完全工作的代码。(我没有在我这一边运行代码,但我确信它可以工作)
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
import json
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
data = {}
data['name'] = response.xpath('//*[@itemprop="name"]/text()').extract_first()
data['price'] = response.xpath('//*[@itemprop="price"]/text()').extract_first()
#sizes = ??
sizes_url = "http://store.nike.com/html-services/templateData/pdpData?action=getPage&path=%2Fus%2Fen_us%2Fpd%2Fmagista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat%2Fpid-11229710%2Fpgid-11918119&productId=11229710&productGroupId=11918119&catalogId=100701&cache=true&country=US&lang_locale=en_US"
yield Request(url = sizes_url, callback=self.parse_sizes, meta={'data':data})
def parse_shoes(self, response):
resp = json.loads(response.body)
data = response.meta['data']
sizes = resp['response']['pdpData']['skuContainer']['productSkus']
sizesArray = []
for a in sizes:
sizesArray.extend([a["displaySize"]])
yield {
'name' : data['name'],
'price' : data['price'],
'sizes' : sizesArray}注意:
每个产品的sizes_url都是不同的,所以您必须花费一些时间来查看它所使用的参数。
https://stackoverflow.com/questions/42594117
复制相似问题