我正在尝试使用nokogiri从亚马逊html页面获取ASIN号码,但使用xpath没有成功。我已经用firepath尝试过了,但仍然一无所获。如果只获取URL,然后运行ruby REGEX来获取ASIN,会不会更好?如果是这样的话,正则表达式会是什么样子呢?
#!/usr/bin/env ruby -w
require 'nokogiri'
require 'open-uri'
url = "http://www.amazon.com/gp/new-releases/books/3839/ref=zg_bsnr_nav"
doc = Nokogiri::HTML(open(url))
puts doc.xpath('//zg_list').each do | node|
p node['asin']
end这就是当它打印出url时我得到的。
#!/usr/bin/env ruby -w
require 'nokogiri'
require 'open-uri'
url = "http://www.amazon.com/gp/new-releases/books/3839/ref=zg_bsnr_nav"
doc = Nokogiri::HTML(open(url))
l = doc.css('div.zg_image a').map { |link|
link['href']
}
puts l # => /Introducing-ZBrush-4-Eric-Keller/dp/0470527641/ref=zg_bsnr_3839_20/183-0702383-0095048发布于 2011-04-08 14:01:14
对我来说,Nokogiri中的css方法比XPath更容易使用。给定您发布的URL处的HTML,以下代码应检索每个项目的“asin”属性:
doc.css("div.zg_item").map { |e| e["asin"] }我认为正确的XPath应该是这样的:
doc.xpath("//div[contains(@class, 'zg_item') and @asin]")发布于 2011-04-08 15:57:28
您可以使用CSS accessors或XPath:
#!/usr/bin/env ruby -w
require 'nokogiri'
require 'open-uri'
url = "http://www.amazon.com/gp/new-releases/books/3839/ref=zg_bsnr_nav"
doc = Nokogiri::HTML(open(url))
# CSS
# puts doc.search('div[class="zg_item zg_sparseListItem"]').each { |n| p n['asin'] }
# XPath
puts doc.search('//div[@class="zg_item zg_sparseListItem"]').each { |n| p n['asin'] }
# >> "1934356549"
# >> "0596802471"
# >> "B004M8T01Q"
# >> "0596809158"
# >> "0470943327"
# >> "B004MMEJ36"
# >> "1935182641"
# >> "B004RDOPJI"
# >> "1449390501"
# >> "1449389716"
# >> "B004IWRH4I"
# >> "0470527641"
# >> "0735650926"
# >> "1430231475"
# >> "0321751043"
# >> "B004NBZ65G"
# >> "B004TMNSJK"
# >> "0132091518"
# >> "144030842X"
# >> "1430234040"
# >> 0https://stackoverflow.com/questions/5590562
复制相似问题