所以,我必须在一个网站的每个网页中抓取一个表格,有324个网页(意味着324个表格),每个表格有1000行和7列,但1列是无用的,我没有使用那一列。
代码还可以,但问题是它非常慢,而且需要很多时间。我想知道我是否可以对代码做一些修改,让它更快!
代码如下:
driver = webdriver.Chrome('./chromedriver.exe')
driver.get('https://beheshtezahra.tehran.ir/Default.aspx?tabid=92')
driver.maximize_window()
part_count = 1
li = []
for i in range(0, 324):
start = timeit.default_timer()
firstname = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='dnn$ctr1877$DeadSearch$txtname']")))
lastname = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='dnn$ctr1877$DeadSearch$txtFamily']")))
part = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='dnn$ctr1877$DeadSearch$txtPart']")))
firstname.clear()
firstname.send_keys("%")
lastname.clear()
lastname.send_keys("%")
part.clear()
part.send_keys(str(part_count))
driver.find_element_by_xpath('//*[@id="dnn_ctr1877_DeadSearch_btnSearch"]').click()
print('Saving the information..')
first_name = driver.find_elements_by_xpath('//table/tbody/tr/td[2]')
last_name = driver.find_elements_by_xpath('//table/tbody/tr/td[3]')
fathers_name = driver.find_elements_by_xpath('//table/tbody/tr/td[4]')
birth_date = driver.find_elements_by_xpath('//table/tbody/tr/td[5]')
death_date = driver.find_elements_by_xpath('//table/tbody/tr/td[6]')
grave_info = driver.find_elements_by_xpath('//table/tbody/tr/td[7]')
print('Appending the information..')
for j in range(0, 1000):
li.append(first_name[j].text)
li.append(last_name[j].text)
li.append(fathers_name[j].text)
li.append(birth_date[j].text)
li.append(death_date[j].text)
li.append(grave_info[j].text)
print('Page ' + str(part_count) + ' is crawled!')
stop = timeit.default_timer()
part_count += 1
print('Time: ', stop - start) 最后,我将列表写入CSV文件。如有任何建议,我们将不胜感激!
发布于 2021-10-26 19:17:55
在print('Saving the information..')部件之后可以做什么:
print('Saving the information..')
page_snapshot = lxml.html.document_fromstring(driver.page_source)
first_name = page_snapshot.xpath('//table/tbody/tr/td[2]')
last_name = page_snapshot.xpath('//table/tbody/tr/td[3]')
fathers_name = page_snapshot.xpath('//table/tbody/tr/td[4]')
birth_date = page_snapshot.xpath('//table/tbody/tr/td[5]')
death_date = page_snapshot.xpath('//table/tbody/tr/td[6]')
grave_info = page_snapshot.xpath('//table/tbody/tr/td[7]')
print('Appending the information..')
for j in range(0, 1000):
li.append(first_name[j].text)
li.append(last_name[j].text)
li.append(fathers_name[j].text)
li.append(birth_date[j].text)
li.append(death_date[j].text)
li.append(grave_info[j].text)lxml非常快,就像import lxml.html (在pip安装ofc之后)。:)),并确保在抓取快照之前页面已完全加载。
https://stackoverflow.com/questions/69515926
复制相似问题