我有这个刮刀,我试图导出为一个csv文件在谷歌Colab。我以字符串值的形式收到了抓取的信息,但无法将其转换为csv。我希望每个抓取的属性"title“、"size”等填充csv文件中的一列。我已经在Beautiful soup中运行了字符串,以删除HTML格式。请参阅下面的代码以获得帮助。
import pandas as pd
import time
import io
from io import StringIO
import csv
#from google.colab import drive
#drive.mount('drive')
#Use new Library (kora.selenium) to run chromedriver
from kora.selenium import wd
#Import BeautifulSoup to parse HTML formatting
from bs4 import BeautifulSoup
wd.get("https://www.grailed.com/sold/EP8S3v8V_w") #Get webpage
ScrollNumber=round(200/40)+1
for i in range(0,ScrollNumber):
wd.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(2)
#--------------#
#Each new attribute will have to found using XPATH because Grailed's website is written in Javascript (js.react) not HTML
#Only 39 results will show because the JS page is infinite scroll and selenium must be told to keep scrolling.
follow_loop = range(2, 200)
for x in follow_loop:
#Title
title = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
title += str(x)
title += "]/a/div[3]/div[2]/p"
title = wd.find_elements_by_xpath(title)
title = str(title)
#Price
price = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
price += str(x)
price += "]/div/div/p/span"
price = wd.find_elements_by_xpath(price)
price = str(price)
#Size
size = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
size += str(x)
size += "]/a/div[3]/div[1]/p[2]"
size = wd.find_elements_by_xpath(size)
size = str(size)
#Sold
sold = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
sold += str(x)
sold += "]/a/p/span"
sold = wd.find_elements_by_xpath(sold)
sold = str(sold)
#Clean HTML formatting using Beautiful soup
cleantitle = BeautifulSoup(title, "lxml").text
cleanprice = BeautifulSoup(price, "lxml").text
cleansize = BeautifulSoup(size, "lxml").text
cleansold = BeautifulSoup(sold, "lxml").text发布于 2021-01-02 12:14:32
这可是一大堆工作
from selenium import webdriver
import time
import csv
driver = webdriver.Chrome()
driver.get("https://www.grailed.com/sold/EP8S3v8V_w")
scroll_count = round(200 / 40) + 1
for i in range(scroll_count):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(2)
time.sleep(3)
titles = driver.find_elements_by_css_selector("p.listing-designer")
prices = driver.find_elements_by_css_selector("p.sub-title.sold-price")
sizes = driver.find_elements_by_css_selector("p.listing-size.sub-title")
sold = driver.find_elements_by_css_selector("div.-overlay")
data = [titles, prices, sizes, sold]
data = [list(map(lambda element: element.text, arr)) for arr in data]
with open('sold_shoes.csv', 'w') as file:
writer = csv.writer(file)
j = 0
while j < len(titles):
row = []
for i in range(len(data)):
row.append(data[i][j])
writer.writerow(row)
j += 1我不确定为什么它会在文件中的每一行之间换行,但我认为这不是问题。此外,这是一个天真的解决方案,因为它假设每个列表的大小是相同的,考虑使用一个列表并从父元素的子元素中创建新的列表。此外,我只使用了没有BeautifulSoup的Selenium,因为它对我来说更容易,但您也应该学习BS,因为它比Selenium更快。祝你编码愉快。
https://stackoverflow.com/questions/65535311
复制相似问题