我试图从搜索结果中获取数据,但是每次我尝试使用一个特定的链接提供给“美丽汤”时,我都会发现错误,我认为这是因为网页不是每次访问时都是相同的吗?我不太清楚这是什么叫甚至搜索,所以任何帮助将不胜感激。
这是到搜索结果的链接。但是,当你去访问它,除非你已经做了搜索,它不会显示的结果。https://www.clarkcountycourts.us/Portal/Home/WorkspaceMode?p=0
相反,如果您复制和粘贴它将带您到此页面进行搜索。https://www.clarkcountycourts.us/Portal/,然后您必须单击智能搜索。
因此,为了简单起见,假设我们搜索"Robinson“,我需要将表数据导出到excel文件中。我不能给美丽的汤一个链接,因为它是无效的,我相信?我该如何应对这个挑战?
即使用一个简单的视图表将表拉起来,也不会给出来自我们搜索"Robinson“的数据的任何信息,比如Case Number或File Date来创建一个熊猫数据框架。
//编辑//到目前为止,多亏了@Arun深处Chohan,这就是我所得到的。巨大的呼喊为伟大的帮助!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(20) # gives an implicit wait for 20 seconds
driver.get("https://www.clarkcountycourts.us/Portal/Home/Dashboard/29")
search_box = driver.find_element_by_id("caseCriteria_SearchCriteria")
search_box.send_keys("Robinson")
#Code to complete captchas
WebDriverWait(driver, 15).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[name^='a-'][src^='https://www.google.com/recaptcha/api2/anchor?']")))
WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//span[@id='recaptcha-anchor']"))).click()
driver.switch_to.default_content() #necessary to switch out of iframe element for submit button
time.sleep(5) #gives time to click submit to results
submit_box = driver.find_element_by_id("btnSSSubmit").click()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,'html.parser')
df = pd.read_html(str(soup))[0]
print(df)发布于 2022-02-03 05:01:03
options = Options()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
driver.maximize_window()
wait=WebDriverWait(driver,10)
driver.get('https://www.clarkcountycourts.us/Portal/')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"a.portlet-buttons"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input#caseCriteria_SearchCriteria"))).send_keys("Robinson")
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@title='reCAPTCHA']")))
elem=wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"div.recaptcha-checkbox-checkmark")))
driver.execute_script("arguments[0].click()", elem)
driver.switch_to.default_content()
x = input("Waiting for recaptcha done")
wait.until(EC.element_to_be_clickable((By.XPATH,"(//input[@id='btnSSSubmit'])[1]"))).click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(str(soup))[0]
print(df)如果你想知道的话,应该是访问你的页面的最低限度。有一个要处理的问题,还有一个需要处理的问题。在此之后,只需利用熊猫来抢桌子。
(编辑):他们适当地添加了一个recaptcha,所以在我添加暂停输入的地方添加了一个求解器。
进口:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from bs4 import BeautifulSoup产出:
Waiting for manual date to be entered. Enter YES when done.
Unnamed: 0_level_0 ... Date of Birth
Case Number ... File Date
Case Number ... File Date
0 NaN ... NaN
1 NaN ... Cases (1) Case NumberStyle / DefendantFile Da...
2 Case Number ... File Date
3 08A575873 ... 11/17/2008
4 NaN ... NaN
5 NaN ... Cases (1) Case NumberStyle / DefendantFile Da...
6 Case Number ... File Date
7 08A575874 ... 11/17/2008https://stackoverflow.com/questions/70965897
复制相似问题