首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >TimeoutException:消息:随机运行的网络报废

TimeoutException:消息:随机运行的网络报废
EN

Stack Overflow用户
提问于 2022-08-29 18:43:58
回答 1查看 50关注 0票数 -1

我想从搜索结果中抓取每一个产品的产品描述。搜索结果为50页,每页有60种产品。所以,我总共需要刮3000件产品。在我当前的代码中,这个错误是随机运行的:

代码语言:javascript
复制
---------------------------------------------------------------------------
TimeoutException                          Traceback (most recent call last)
/var/folders/hj/yrd6ng651fv5d2_gtngcysy40000gn/T/ipykernel_50778/696262163.py in <module>
     34 for link in product_links:
     35         driver.get(link)
---> 36         WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.CLASS_NAME, "page-product")))
     37 
     38         driver.execute_script("""

/opt/miniconda3/lib/python3.9/site-packages/selenium/webdriver/support/wait.py in until(self, method, message)
     78             if time.time() > end_time:
     79                 break
---> 80         raise TimeoutException(message, screen, stacktrace)
     81 
     82     def until_not(self, method, message=''):

TimeoutException: Message: 

有时错误发生在它刮掉60个数据之后,有时在300个数据之后,有时甚至没有刮掉一个数据。

我试图将WebDriverWait从10修改到100。不过,这并不能解决这个问题。

有人知道怎么解决这个问题吗?

这是我的代码:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options       # to customize chrome display
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC 
from time import sleep
from collections import Counter
import json
from turtle import delay
import time
import pandas as pd

# create object for chrome options
chrome_options = Options()

# Customize chrome display
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1365,6572.610")
chrome_options.add_argument('--disable-infobars')      

# create webdriver object
path = '/Applications/chromedriver'
webdriver_service = Service(path)
driver = webdriver.Chrome(executable_path=path, options=chrome_options)

baseurl = 'https://shopee.co.id'

product_links = []

for page in range(5, 11):
    search_link = 'https://shopee.co.id/search?keyword=obat%20kanker&page={}'.format(page)
    driver.get(search_link)
    WebDriverWait(driver, 80).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "shopee-search-item-result")))

    driver.execute_script("""
            var scroll = document.body.scrollHeight / 10;
            var i = 0;
            function scrollit(i) {
            window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
            i++;
            if (i < 10) {
                setTimeout(scrollit, 500, i);
                }
            }
            scrollit(i);
            """)
    sleep(5)
    html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
    soup = BeautifulSoup(html, "html.parser")

    product_list = soup.find_all('div',class_='col-xs-2-4 shopee-search-item-result__item' )
    for item in product_list:
        for link in item.find_all('a', href=True):
            product_links.append(baseurl + link['href'])

#testlink = 'https://shopee.co.id/Obat-Herbal-Kanker-Payudara-Serviks-Hati-Usus-Prostat-Leukimia-dan-Paru-Paru-ORIGINAL-100-ASLI-i.166801435.2584201334?sp_atk=70c736d4-ed07-435c-8edd-4e6e7552a91d&xptdk=70c736d4-ed07-435c-8edd-4e6e7552a91d'

herbcancerlist = []
for link in product_links:
        driver.get(link)
        WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.CLASS_NAME, "page-product")))

        driver.execute_script("""
                var scroll = document.body.scrollHeight / 10;
                var i = 0;
                function scrollit(i) {
                window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
                i++;
                if (i < 10) {
                setTimeout(scrollit, 500, i);
                }
                }
                scrollit(i);
                """)

        sleep(10)
        html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        soup = BeautifulSoup(html, "html.parser")

        try:
                name = soup.find('div', class_='_2rQP1z').get_text()
                price = soup.find('div', class_='_2Shl1j').get_text() 
                sold = soup.find('div', class_ = 'HmRxgn').get_text() 
                rate = soup.find('div', class_ = '_3y5XOB _14izon').get_text() 
                city = soup.find('span', class_ = '_2fJrvA').get_text() 
                specification = soup.find('div', class_ = '_2jz573').get_text()          
        except:
                name = 'No name'
                price = 'No price'
                sold = 'No value'
                rate = 'No rate'
                city = 'No city'
                specification = 'No spec'


        herbcancer = {
                'name': name,
                'price': price,
                'sold': sold,
                'rate': rate,
                'city': city,
                'specification': specification
                }

        herbcancerlist.append(herbcancer)
        print('Saving: ', herbcancer['name'])

df = pd.DataFrame(herbcancerlist)
print(df.head())
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-08-29 20:08:15

这是获取所有这些产品的信息的一种方法,将它们放入一个dataframe中,并将该数据存储到磁盘中,作为一个csv文件:

代码语言:javascript
复制
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import json
import pandas as pd
from tqdm.notebook import tqdm 

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
chrome_options.add_argument('--load-extension=/home/user/.config/chromium/Default/Extensions/cjpalhdlnbpafiamejdnhcphjbkeiagm/1.44.0_1')


webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

big_df = pd.DataFrame()
for x in tqdm(range(50)):
    url = f'https://shopee.co.id/search?keyword=obat%20kanker&page={x}' 
    browser.get(url)
    t.sleep(5)
    items = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'script[data-rh="true"]'))) 
    for i in items:
        json_obj = json.loads(i.get_attribute('innerHTML'))
        if json_obj['@type'] == 'Product':
            big_df = pd.concat([big_df, pd.json_normalize(json_obj)], axis=0, ignore_index=True)
    t.sleep(1)
print(big_df)
big_df.to_csv('medicinal_herbs_indonesian.csv')

注意使用加块扩展,以及熊猫/帕达斯的使用。在终端打印的结果:

代码语言:javascript
复制
@context    @type   name    description     url     productID   image   brand   offers.@type    offers.price    offers.priceCurrency    offers.availability     aggregateRating.@type   aggregateRating.bestRating  aggregateRating.worstRating     aggregateRating.ratingCount     aggregateRating.ratingValue     offers.lowPrice     offers.highPrice
0   http://schema.org   Product     GRAVIDA BHARATA OBAT KANKER PAYUDARA AMPUH |KANKER GANAS HERBAL TERDAFTAR DBPOM MUI WARYANTO076         https://shopee.co.id/GRAVIDA-BHARATA-OBAT-KANKER-PAYUDARA-AMPUH-KANKER-GANAS-HERBAL-TERDAFTAR-DBPOM-MUI-WARYANTO076-i.282306593.7674724221  7674724221  https://cf.shopee.co.id/file/40d3d0c5a7fc388294950b7586081843       Offer   275000.00   IDR     http://schema.org/InStock   AggregateRating     5.0     1.0     252     4.89    NaN     NaN
1   http://schema.org   Product     Walatra Zedoril 7 Asli Obat Herbal Kanker Tumor Dan Segala Jenis Benjolan Aman Tanpa Efek Samping       https://shopee.co.id/Walatra-Zedoril-7-Asli-Obat-Herbal-Kanker-Tumor-Dan-Segala-Jenis-Benjolan-Aman-Tanpa-Efek-Samping-i.189502097.3139156637   3139156637  https://cf.shopee.co.id/file/b0b7d9c09666f2fd237f3279b8194dc4   Walatra     Offer   255000.00   IDR     http://schema.org/InStock   AggregateRating     5.0     1.0     1272    4.80    NaN     NaN
2   http://schema.org   Product     Obat Herbal Kanker Payudara, Serviks, Hati, Usus, Prostat, Leukimia dan Paru Paru ORIGINAL 100% ASLI        https://shopee.co.id/Obat-Herbal-Kanker-Payudara-Serviks-Hati-Usus-Prostat-Leukimia-dan-Paru-Paru-ORIGINAL-100-ASLI-i.166801435.2584201334  2584201334  https://cf.shopee.co.id/file/1a159f810da5508ec3330ce174ca2eab       Offer   525000.00   IDR     http://schema.org/InStock   AggregateRating     5.0     1.0     1197    4.90    NaN     NaN
3   http://schema.org   Product     IDR Madu Hitam / Obat Kanker / Obat Kanker Serviks 350 gram         https://shopee.co.id/IDR-Madu-Hitam-Obat-Kanker-Obat-Kanker-Serviks-350-gram-i.12836685.4460609439  4460609439  https://cf.shopee.co.id/file/3b96e9f5fb11977b946bb5e1c4585a72   IDR     Offer   250000.00   IDR     http://schema.org/InStock   AggregateRating     5.0     1.0     6   4.83    NaN     NaN
[....]

按照这个逻辑,您可以从big_df‘’url‘获取单个产品’‘url’,并使用adblock扩展逐个对它们进行抓取。为了使代码更加冗余,您可以使用数据库,在那里保存大数据,在那里保存产品,使用try/ you块,标记成功的块,重试其他的。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73533345

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档