文章/答案/技术大牛

发布

社区首页 >问答首页 >用Selenium - XPath问题抓取亚马逊

问用Selenium - XPath问题抓取亚马逊
EN

Stack Overflow用户

提问于 2020-09-08 12:04:53

回答 1查看 2.2K关注 0票数 0

我正在做一个课程项目，但是我从亚马逊获得的数据缺少了产品的名称、价格和类别。因为我没有API的AWS帐户，所以我决定根据我拥有的ASIN (产品ID)来抓取这个信息。但是我对web抓取还不太了解(例如XML结构)。代码的刮取部分是从一个功能论坛的刮取项目中改编的，但是它在这里不起作用。

我也尝试过BeautifulSoup，我甚至特别从类似的亚马逊项目中找到了它，但它也不起作用。因为Selenium是多功能的，所以我更喜欢用这种方法学习。下面是使用非功能性XPath的代码：

from selenium import webdriver
from random import randint

asin_set = ['0151004714', '0380709473','0511189877', '0528881469', '0545105668', '0557348153', '0594033926', '0594296420', '0594450268', '0594451647', '0594459451', '0594481902', '059449771X']

driver = webdriver.Chrome()
list_of_dicts[:] = []
print('This is gonna be LEGEN... wait for it:')
for i in asin_set[:5]:
    url = f'https://www.amazon.com/gp/product/{i}'
    driver.get(url)
    product_info = {}
    product_info['asin'] = i
    try:
        name = driver.find_elements_by_xpath('//*[@id="' + x + '"]')   #<---
        product_info['name'] = name.text('productTitle')               #<---
    except:
        product_info['name'] = 0
    try:
        price = driver.find_elements_by_xpath('//*[@id="' + x + '"]')  #<---
        product_info['price'] = price.text                             #<---
    except:
        product_info['price'] = 0
    try:
        category = driver.find_elements_by_xpath('//*[@id="' + x + '"]/ul/li[5]/span/a')        #<---
        product_info['category'] = category.get_attribute('wayfinding-breadcrumbs_feature_div') #<---
    except:
        product_info['category'] = 0

    list_of_dicts.append(product_info)  # Append scrape to dictionary
    print(str(len(list_of_dicts)) + ' . ', end='')   # print the current length of the scrapes
    sleep(randint(1,2))               # Sleep 1 or 2 seconds in bewteen scrapes
print('DARY!')

单元格运行良好，浏览器打开每个页面。但是没有正确地访问或存储东西，list_of_dicts的结果是：

[{'asin': '0151004714', 'name': 0, 'price': 0, 'category': 0},
 {'asin': '0380709473', 'name': 0, 'price': 0, 'category': 0},
 {'asin': '0511189877', 'name': 0, 'price': 0, 'category': 0},
 {'asin': '0528881469', 'name': 0, 'price': 0, 'category': 0},
 {'asin': '0545105668', 'name': 0, 'price': 0, 'category': 0}]

python

selenium

xpath

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-08 14:05:03

使用WebDriverWait()代替睡眠，等待visibility_of_element_located()并使用以下xpath。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

asin_set = ['0151004714', '0380709473','0511189877', '0528881469', '0545105668', '0557348153', '0594033926', '0594296420', '0594450268', '0594451647', '0594459451', '0594481902', '059449771X']

driver = webdriver.Chrome()
list_of_dicts= []
print('This is gonna be LEGEN... wait for it:')
for i in asin_set:
    url = 'https://www.amazon.com/gp/product/{}'.format(i)
    driver.get(url)
    product_info = {}
    product_info['asin'] = i
    WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,'//span[@id="productTitle"]')))
    try:
        name = driver.find_element_by_xpath('//span[@id="productTitle"]')
        product_info['name'] = name.text.strip()
    except:
        product_info['name'] = 0
    try:
        price = driver.find_element_by_xpath("(//span[contains(@class,'a-color-price')])[1]")
        product_info['price'] = price.text
    except:
        product_info['price'] = 0
    try:
        category = driver.find_element_by_xpath("(//span[@class='a-list-item']/a)[last()]")
        product_info['category'] = category.text.strip()
    except:
        product_info['category'] = 0

    list_of_dicts.append(product_info)  # Append scrape to dictionary
    print(str(len(list_of_dicts)) + ' . ', end='')   # print the current length of the scrapes

print('DARY!')
print(list_of_dicts)

控制台输出：

这将是合法的..。等待它:1。2.3.4.5.6.7.8.9.10 .11 .12 .13 .达里！

[{'price': '$16.80', 'name': 'The Last Life: A Novel', 'asin': '0151004714', 'category': 'eBook Readers'}, {'price': '$11.10', 'name': "Crows Can't Count", 'asin': '0380709473', 'category': 'eBook Readers'}, {'price': '$4.00', 'name': 'URC CLIKR-5 Time Warner Cable Remote Control UR5U-8780L', 'asin': '0511189877', 'category': 'Remote Controls'}, {'price': 'Currently unavailable.', 'name': 'Rand McNally 528881469 7-inch Intelliroute TND 700 Truck GPS', 'asin': '0528881469', 'category': 'Trucking GPS'}, {'price': '$13.97', 'name': 'Elephant Run', 'asin': '0545105668', 'category': 'eBook Readers'}, {'price': '$83.59', 'name': 'Knighthorse', 'asin': '0557348153', 'category': 'eBook Readers'}, {'price': 'Currently unavailable.', 'name': 'Barnes & Noble Dessin Leather Cover for Nook Color & Nook Tablet Digital Reader - Noir', 'asin': '0594033926', 'category': 'eBook Readers & Accessories'}, {'price': 'Currently unavailable.', 'name': 'Barnes & Noble Power Adapter for Nook Simple Touch', 'asin': '0594296420', 'category': 'AC Adapters'}, {'price': 'Currently unavailable.', 'name': 'Nook Hd + 9-Inch Groovy Protective Stand Cover, Storm Gray', 'asin': '0594450268', 'category': 'Cases'}, {'price': '$15.99', 'name': 'Barnes & Noble HDTV Adapter Kit for NOOK HD and NOOK HD+', 'asin': '0594451647', 'category': 'Chargers & Adapters'}, {'price': 'Only 3 left in stock - order soon.', 'name': 'Barnes & Noble Nook Color Tablet USB Cable Charger Newest Re-enforced Version', 'asin': '0594459451', 'category': 'Power Cables'}, {'price': '$47.88', 'name': 'Barnes & Noble OV/HB Universal Power Kit for Nook HD & HD+', 'asin': '0594481902', 'category': 'Power Adapters'}, {'price': '$39.88', 'name': 'Barnes & Noble Replacement Charging Sync Cable for Nook HD and HD+ (5 Feet)', 'asin': '059449771X', 'category': 'Power Cables'}]

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63793547

复制

相似问题

问用Selenium - XPath问题抓取亚马逊
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Selenium - XPath问题抓取亚马逊EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Selenium - XPath问题抓取亚马逊
EN