文章/答案/技术大牛

发布

社区首页 >问答首页 >数据只能交替地从网站获得正确(不一致)的数据。

问数据只能交替地从网站获得正确(不一致)的数据。
EN

Stack Overflow用户

提问于 2022-11-23 00:09:26

回答 1查看 35关注 0票数 0

我正在尝试从一个网站获取数据，下面是我所做的事情的代码：

这些是模块

import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager

下面是获取每个目标产品的urls：

driver = webdriver.Chrome(ChromeDriverManager().install())

for page in tqdm(range(5, 10)):
    driver.get("https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page="+str(page)+"&sortBy=pop")
    
    skincare = driver.find_elements(By.XPATH, '//div[@class="col-xs-2-4 shopee-search-item-result__item"]//a[@data-sqe="link"]')

    for _skincare in tqdm(skincare):
        urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

它被成功地取走了。接下来我要做的是：

data_final = pd.DataFrame(urls)

driver = webdriver.Chrome(ChromeDriverManager().install())
skincares = []

for product in tqdm(data_final["url"]):
    driver.get(product)
    try:
        company = driver.find_element(By.XPATH,"//div[@class='CKGyuW']//div[@class='_1Yaflp page-product__shop']//div[@class='_1YY3XU']//div[@class='zYQ1eS']//div[@class='_3LoNDM']").text
    except:
        company = 'none'
    try:
        product_name = driver.find_element(By.XPATH,"//div[@class='flex flex-auto eTjGTe']//div[@class='flex-auto flex-column  _1Kkkb-']//div[@class='_2rQP1z']//span").text
    except:
        product_name = 'none'
    try:
        rating = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB _14izon']").text
    except:
        rating = 'none'
    try:
        number_of_ratings = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB']").text
    except:
        number_of_ratings = 'none'
    try:
        sold = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3EOMd6']//div[@class='HmRxgn']").text
    except:
        sold = 'none'
    try:
        price = driver.find_element(By.XPATH,"//div[@class='_2Shl1j']").text
    except:
        price = 'none'
    try:
        description = driver.find_element(By.XPATH,"//div[@class='_1MqcWX']//p[@class='_2jrvqA']").text
    except:
        description = 'none'
        
    
    skincares.append({
        "url": product,
        "company": company,
        "product name": product_name,
        "rating": rating,
        "number of ratings": number_of_ratings,
        "sold": sold,
        "price": price,
        "description": description,

        })
    time.sleep(5)

为了避免阻塞，我使用了time.sleep(x)，并尝试了x= 1、1.5、2、5、15。上面的代码不一致。呼叫

skincares_data = pd.DataFrame(skincares)
skincares_data

我得到了在这里输入图像描述

这是一堆空白或未正确获取数据。有一件事是，如果我重新运行代码，我会得到另一组数据，其中一些空白的现在有数据，而一些被正确获取的数据现在是空的。再运行一次，同样的问题也会发生。

我认为被网站“屏蔽”并不是问题所在(我只是用time.sleep()来确保)。

有什么评论吗？

我试图从一个网站获得数据，我成功地获得了urls，但是每个产品的细节都没有被正确的获取。有很多空白数据。它们要么是空白的，要么被适当地取走。

web-scraping

beautifulsoup

python

selenium

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-11-23 06:35:39

页面正在动态加载，当您向下滚动它时。以下代码将解决您的问题：

[..]
wait = WebDriverWait(driver, 15)
url='https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page=1&sortBy=pop'
driver.get(url)
rows= wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "shopee-search-item-result__item")]')))
for r in rows:
    r.location_once_scrolled_into_view
t.sleep(5)
products = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@data-sqe="item"]')))
for p in products:
    name = p.find_element(By.XPATH, './/div[@data-sqe="name"]').text.strip()
    some_id = p.find_element(By.XPATH, './/a[@data-sqe="link"]').get_attribute('href').split('?sp_atk=')[0].split('-i.')[1]
    print(name, some_id)

所有物品将在终端机上印刷：

ORIG M.Q. Cosmetics MACAROON LIP THERAPY LIPBALM WITH SPATULA | MQ
wholesale 10092844.9115684791
Magic Lip Therapy Balm in 10g jar (FREE Spatula) Rebranding NO STICKER! 286498185.11511633880
BIOAQUA COLLAGEN Nourish Lips Membrane Moisturizing Lip Mask moisture nourishing skin care soft 295464315.8585504678
Lip therapy Cosmetic Potion lipbalm
₱5 off
Free Gift 11055729.11663828134
VASELINE Rosy Lip Stick 4.8g 92328166.8130605004
Collagen Crystal lip mask lips plump gel personal care hydrating lip whitening a smacker wrinkle gel 386726777.2925165359
blk cosmetics fresh lip scrub coco crush 62677292.5532509493
[...]

Selenium文档可以找到这里

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74540393

复制

相似问题

问数据只能交替地从网站获得正确(不一致)的数据。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据只能交替地从网站获得正确(不一致)的数据。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据只能交替地从网站获得正确(不一致)的数据。
EN