我试着刮商店的产品名称,价格和形象。然而,我似乎无法提取图像。是因为html吗?我只是在dataImg中找不到图像类
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
driver =webdriver.Chrome('chromedriver')
products=[]
prices=[]
images=[]
driver.get('https://shopee.co.id/search?keyword=laptop')
content=driver.page_source
soup=BeautifulSoup(content)
soup
for link in soup.find_all('div',class_="_3EfFTx"):
print('test')
print(link)
for link in soup.find_all('div',class_="_3EfFTx"):
#print(link)
dataImg=link.find('img',class_="_1T9dHf V1Fpl5")
print(dataImg)
name=link.find('div',class_="_1Sxpvs")
#print(name.get_text())
price=link.find('div',class_="QmqjGn")
#print(price.get_text())
if dataImg is not None:
products.append(name.get_text())
prices.append(price.get_text())
images.append(dataImg['src'])
df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df发布于 2021-02-06 10:40:18
会发生什么?
在加载所有内容时,您可以抓取源代码。如果您等待的时间稍长,这将无助于只加载第一个图像,其余的图像只有在它们进入视图时才会被加载。
怎么解决这个问题?
您必须等待一段时间,而不是一步步滚动到页面的底部:
time.sleep(5)
for i in range(10):
driver.execute_script("window.scrollBy(0, 350)")
time.sleep(1) 示例
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver =webdriver.Chrome('chromedriver')
products=[]
prices=[]
images=[]
driver.get('https://shopee.co.id/search?keyword=laptop')
time.sleep(5)
for i in range(10):
driver.execute_script("window.scrollBy(0, 350)")
time.sleep(1)
content=driver.page_source
soup=BeautifulSoup(content)
for item in soup.select('div[data-sqe="item"]'):
dataImg=item.img
name=item.find('div',class_="_1Sxpvs")
price=item.find('div',class_="QmqjGn")
if dataImg is not None:
products.append(name.get_text())
prices.append(price.get_text())
images.append(dataImg['src'])
df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df 输出
Product Name Price Images
0 [ACQ] Meja Laptop Lipat Portable Rp51.990 https://cf.shopee.co.id/file/83a9e6e8ecad7a3db...
1 LENOVO Thinkpad CORE i5 Ram 8GB/ 2TB/1TB/500GB... Rp2.100.000 - Rp4.200.000 https://cf.shopee.co.id/file/44fbc24f5c585cda1...
2 HP Laptop 14s-cf3076TU/i3-1005G1/256GB SSD/14"... Rp6.599.000Rp6.598.999 https://cf.shopee.co.id/file/170a45679aa5002f1...
...发布于 2021-02-06 10:26:31
该网站使用JS加载图像,为了绕过这一问题,您需要稍微延迟一下selenium。下面是下载映像src的代码:
from selenium import webdriver
from time import sleep
products=[]
prices=[]
images=[]
driver = webdriver.Chrome(r'F:\Sonstiges\chromedriver\chromedriver.exe')
driver.get('https://shopee.co.id/search?keyword=laptop')
sleep(8)
imgs = driver.find_elements_by_class_name('_1T9dHf')
for img in imgs:
img_url = img.get_attribute("src")
if img_url:
print(img_url)
driver.quit()为了获得图像,只需使用获取的URI执行this即可。如果您使用优美汤只是因为它在后台运行,is here the soloution用于运行selenium headless (在后台)。
https://stackoverflow.com/questions/66073052
复制相似问题