我试图抓取一个网站(https://harleytherapy.com/therapists?page=1),它看起来像是由Javascript生成的,而我试图抓取的元素(带有id="downshift-7-menu"的lu )并没有出现在“页面源代码”中,而只是在我点击“检查元素”之后才出现。
我试着在这里找到一个解决方案,到目前为止,这是我能想出的代码(硒+美汤的组合)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
url = "https://harleytherapy.com/therapists?page=1"
options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get(url)
time.sleep(15)
plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'html')
therapist_menu_id = "downshift-7-menu"
print(soup.find(id=therapist_menu_id))我认为让Selenium等待15秒可以确保所有元素都已加载,但我仍然在汤中找不到任何id为downshift-7-menu的元素。你们知道我的代码出了什么问题吗?
发布于 2020-12-27 19:56:42
ID为downshift-7-menu的元素只有在打开THERAPIST下拉菜单后才会加载,您可以通过滚动到视图中加载它,然后单击它来完成此操作。您还应该考虑用显式等待替换睡眠
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 15)
# scroll the dropdown into view to load it
side_menu = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'inner-a377b5')))
last_height = driver.execute_script("return arguments[0].scrollHeight", side_menu)
while True:
driver.execute_script("arguments[0].scrollTo(0, arguments[0].scrollHeight);", side_menu)
new_height = driver.execute_script("return arguments[0].scrollHeight", side_menu)
if new_height == last_height:
break
last_height = new_height
# open the menu
wait.until(EC.visibility_of_element_located((By.ID, 'downshift-7-input'))).click()
# wait for the option to load
therapist_menu_id = 'downshift-7-menu'
wait.until(EC.presence_of_element_located((By.ID, therapist_menu_id)))
print(soup.find(id=therapist_menu_id))https://stackoverflow.com/questions/65465339
复制相似问题