我正在尝试从一个网站获取数据,下面是我所做的事情的代码:
这些是模块
import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager下面是获取每个目标产品的urls:
driver = webdriver.Chrome(ChromeDriverManager().install())
for page in tqdm(range(5, 10)):
driver.get("https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page="+str(page)+"&sortBy=pop")
skincare = driver.find_elements(By.XPATH, '//div[@class="col-xs-2-4 shopee-search-item-result__item"]//a[@data-sqe="link"]')
for _skincare in tqdm(skincare):
urls.append({"url":_skincare.get_attribute('href')})
driver.quit()它被成功地取走了。接下来我要做的是:
data_final = pd.DataFrame(urls)
driver = webdriver.Chrome(ChromeDriverManager().install())
skincares = []
for product in tqdm(data_final["url"]):
driver.get(product)
try:
company = driver.find_element(By.XPATH,"//div[@class='CKGyuW']//div[@class='_1Yaflp page-product__shop']//div[@class='_1YY3XU']//div[@class='zYQ1eS']//div[@class='_3LoNDM']").text
except:
company = 'none'
try:
product_name = driver.find_element(By.XPATH,"//div[@class='flex flex-auto eTjGTe']//div[@class='flex-auto flex-column _1Kkkb-']//div[@class='_2rQP1z']//span").text
except:
product_name = 'none'
try:
rating = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB _14izon']").text
except:
rating = 'none'
try:
number_of_ratings = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB']").text
except:
number_of_ratings = 'none'
try:
sold = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3EOMd6']//div[@class='HmRxgn']").text
except:
sold = 'none'
try:
price = driver.find_element(By.XPATH,"//div[@class='_2Shl1j']").text
except:
price = 'none'
try:
description = driver.find_element(By.XPATH,"//div[@class='_1MqcWX']//p[@class='_2jrvqA']").text
except:
description = 'none'
skincares.append({
"url": product,
"company": company,
"product name": product_name,
"rating": rating,
"number of ratings": number_of_ratings,
"sold": sold,
"price": price,
"description": description,
})
time.sleep(5)为了避免阻塞,我使用了time.sleep(x),并尝试了x= 1、1.5、2、5、15。上面的代码不一致。呼叫
skincares_data = pd.DataFrame(skincares)
skincares_data我得到了在这里输入图像描述
这是一堆空白或未正确获取数据。有一件事是,如果我重新运行代码,我会得到另一组数据,其中一些空白的现在有数据,而一些被正确获取的数据现在是空的。再运行一次,同样的问题也会发生。
我认为被网站“屏蔽”并不是问题所在(我只是用time.sleep()来确保)。
有什么评论吗?
我试图从一个网站获得数据,我成功地获得了urls,但是每个产品的细节都没有被正确的获取。有很多空白数据。它们要么是空白的,要么被适当地取走。
发布于 2022-11-23 06:35:39
页面正在动态加载,当您向下滚动它时。以下代码将解决您的问题:
[..]
wait = WebDriverWait(driver, 15)
url='https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page=1&sortBy=pop'
driver.get(url)
rows= wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "shopee-search-item-result__item")]')))
for r in rows:
r.location_once_scrolled_into_view
t.sleep(5)
products = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@data-sqe="item"]')))
for p in products:
name = p.find_element(By.XPATH, './/div[@data-sqe="name"]').text.strip()
some_id = p.find_element(By.XPATH, './/a[@data-sqe="link"]').get_attribute('href').split('?sp_atk=')[0].split('-i.')[1]
print(name, some_id)所有物品将在终端机上印刷:
ORIG M.Q. Cosmetics MACAROON LIP THERAPY LIPBALM WITH SPATULA | MQ
wholesale 10092844.9115684791
Magic Lip Therapy Balm in 10g jar (FREE Spatula) Rebranding NO STICKER! 286498185.11511633880
BIOAQUA COLLAGEN Nourish Lips Membrane Moisturizing Lip Mask moisture nourishing skin care soft 295464315.8585504678
Lip therapy Cosmetic Potion lipbalm
₱5 off
Free Gift 11055729.11663828134
VASELINE Rosy Lip Stick 4.8g 92328166.8130605004
Collagen Crystal lip mask lips plump gel personal care hydrating lip whitening a smacker wrinkle gel 386726777.2925165359
blk cosmetics fresh lip scrub coco crush 62677292.5532509493
[...]Selenium文档可以找到这里
https://stackoverflow.com/questions/74540393
复制相似问题