文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用Selenium遍历web表？

问如何使用Selenium遍历web表？
EN

Stack Overflow用户

提问于 2021-07-23 02:33:32

回答 2查看 824关注 0票数 1

我一直试图在Crunchbase上迭代这个EV公司表，但出于某种原因，代码只是在拉出第一行。知道为什么吗？谢谢！:)

#imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

#paths
PATH = "C:/Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

driver.get("https://www.crunchbase.com/search/organizations/field/organization.companies/categories/electric-vehicle")
driver.maximize_window()
time.sleep(5)
print(driver.title)

WebDriverWait(driver, 20).until(
        EC.visibility_of_element_located(
          (By.XPATH, ('/html/body/chrome/div/mat-sidenav-container/mat-sidenav-content/div/search/page-layout/div/div/form/div[2]/results/div/div/div[3]/sheet-grid/div/div/grid-body/div/grid-row[1]/grid-cell[2]/div/field-formatter/identifier-formatter/a/div/div')
        )))

companies = driver.find_elements_by_css_selector("div.identifier-label")

#create company dictionary and iterate through Crunchbase EV company table             
company_list = []
                          
for company in companies:
    name = company.find_element_by_xpath('/html/body/chrome/div/mat-sidenav-container/mat-sidenav-content/div/search/page-layout/div/div/form/div[2]/results/div/div/div[3]/sheet-grid/div/div/grid-body/div/grid-row[1]/grid-cell[2]/div/field-formatter/identifier-formatter/a/div/div').text
    industry = company.find_element_by_xpath('/html/body/chrome/div/mat-sidenav-container/mat-sidenav-content/div/search/page-layout/div/div/form/div[2]/results/div/div/div[3]/sheet-grid/div/div/grid-body/div/grid-row[1]/grid-cell[3]/div/field-formatter/identifier-multi-formatter/span').text
    hq = company.find_element_by_xpath('/html/body/chrome/div/mat-sidenav-container/mat-sidenav-content/div/search/page-layout/div/div/form/div[2]/results/div/div/div[3]/sheet-grid/div/div/grid-body/div/grid-row[1]/grid-cell[4]/div/field-formatter/identifier-multi-formatter/span').text
    cblist = {
        'name': name,
        'industry': industry,
        'hq': hq
    }
    company_list.append(cblist)
#create dataframe    
df = pd.DataFrame(company_list)
print(df)

pandas

selenium

selenium-webdriver

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-07-23 04:55:10

首先，您应该获得所有的grid-row来获取表中的所有行，然后应该使用相对xpath (从.开始)只在选定的行中进行搜索。

all_rows = driver.find_elements_by_css_selector("grid-row")

all_companies = []
                          
for row in all_rows:
    company = {
        'name':     row.find_element_by_xpath('.//*[@class="identifier-label"]').text.strip(),
        'industry': row.find_element_by_xpath('.//*[@data-columnid="categories"]//span').text.strip(),
        'hq':       row.find_element_by_xpath('.//*[@data-columnid="location_identifiers"]//span').text.strip(),
        'cb rank':  row.find_element_by_xpath('.//*[@data-columnid="rank_org"]').text.strip(),
    }
    all_companies.append(company)

您还应该学习使用class、id和任何其他独特的值--即。data-columnid。

全工作代码

#imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

#paths
PATH = "C:/Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
#driver = webdriver.Chrome()

url = "https://www.crunchbase.com/search/organizations/field/organization.companies/categories/electric-vehicle"
driver.get(url)
driver.maximize_window()
time.sleep(5)

print('title:', driver.title)

WebDriverWait(driver, 20).until(
        EC.visibility_of_element_located(
          (By.XPATH, ('//grid-body//identifier-formatter/a/div/div')
        )))

all_rows = driver.find_elements_by_css_selector("grid-row")

all_companies = []
                          
for row in all_rows:
    company = {
        'name':     row.find_element_by_xpath('.//*[@class="identifier-label"]').text.strip(),
        'industry': row.find_element_by_xpath('.//*[@data-columnid="categories"]//span').text.strip(),
        'hq':       row.find_element_by_xpath('.//*[@data-columnid="location_identifiers"]//span').text.strip(),
        'cb rank':  row.find_element_by_xpath('.//*[@data-columnid="rank_org"]').text.strip(),
    }
    all_companies.append(company)
    
#create dataframe    
df = pd.DataFrame(all_companies)
print(df)

票数 2

Stack Overflow用户

发布于 2021-07-23 03:15:03

在所有标识符中，为for循环中的每一次迭代增加网格行索引，如。

row_index = row_index + 1

name = company.find_element_by_xpath(
        '/html/body/chrome/div/mat-sidenav-container/mat-sidenav-content/div/search/page-layout/div/div/form/div[2]/results/div/div/div[3]/sheet-grid/div/div/grid-body/div/grid-row['+str(row_index)+']/grid-cell[2]/div/field-formatter/identifier-formatter/a/div/div').text

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68493382

复制

相似问题

问如何使用Selenium遍历web表？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Selenium遍历web表？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Selenium遍历web表？
EN