文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用Selenium和Python从Linkedin页面中刮取嵌套数据

问如何使用Selenium和Python从Linkedin页面中刮取嵌套数据
EN

Stack Overflow用户

提问于 2022-08-12 14:55:26

回答 5查看 161关注 0票数 1

我正坐在为我的硕士项目，我想刮LinkedIn。就我现在而言，当我想刮用户的教育页面时，我遇到了一个问题。https://www.linkedin.com/in/williamhgates/details/education/)

我想对用户的所有教育内容进行清查。在这个例子中，我想在mr1 hoverable-link-text t-bold下刮起“哈佛大学”，但我看不出来。

以下是Linkedin代码中的HTML：

<li class="pvs-list__paged-list-item artdeco-list__item pvs-list__item--line-separated " id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0">
                        <!----><div class="pvs-entity
    pvs-entity--padded pvs-list__item--no-padding-when-nested
    
    ">
  <div>
        <a class="optional-action-target-wrapper 
        display-flex" target="_self" href="https://www.linkedin.com/company/1646/">
        <div class="ivm-image-view-model  pvs-entity__image ">
    <div class="ivm-view-attr__img-wrapper ivm-view-attr__img-wrapper--use-img-tag display-flex
    
    ">
<!---->      <img width="48" src="https://media-exp1.licdn.com/dms/image/C4E0BAQF5t62bcL0e9g/company-logo_100_100/0/1519855919126?e=1668643200&amp;v=beta&amp;t=BL0HxGNOasVbI3u39HBSL3n7H-yYADkJsqS3vafg-Ak" loading="lazy" height="48" alt="Harvard University logo" id="ember59" class="ivm-view-attr__img--centered EntityPhoto-square-3  lazy-image ember-view">
</div>
  </div>
    </a>

  </div>

  <div class="display-flex flex-column full-width align-self-center">
    <div class="display-flex flex-row justify-space-between">
          <a class="optional-action-target-wrapper 
          display-flex flex-column full-width" target="_self" href="https://www.linkedin.com/company/1646/">
        <div class="display-flex align-items-center">
            <span class="mr1 hoverable-link-text t-bold">
              <span aria-hidden="true"><!---->Harvard University<!----></span><span class="visually-hidden"><!---->Harvard University<!----></span>
            </span>
<!----><!----><!---->        </div>
<!---->          <span class="t-14 t-normal t-black--light">
            <span aria-hidden="true"><!---->1973 - 1975<!----></span><span class="visually-hidden"><!---->1973 - 1975<!----></span>
          </span>
<!---->      </a>


<!---->
      <div class="pvs-entity__action-container">
<!---->      </div>
    </div>

      <div class="pvs-list__outer-container">
<!---->    <ul class="pvs-list
        
        ">
        <li class=" ">
                <div class="pvs-list__outer-container">
<!----><!----><!----></div>

        </li>
    </ul>
<!----></div>
  </div>
</div>

                </li>

我尝试了以下代码：

education = driver.find_element("xpath", '//*[@id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0"]/div/div[2]/div[1]/a/div/span/span[1]/').text
print(education)

我不断地发现错误：

Message: no such element: Unable to locate element:

有人能帮忙吗？我希望有一个脚本，循环通过教育，并节省地方的教育和教育年。

python

selenium

xpath

css-selectors

回答 5

Stack Overflow用户

回答已采纳

发布于 2022-08-12 22:48:03

要提取文本，，哈佛大学，，理想情况下，您需要为https://stackoverflow.com/a/50474905/7429447引入WebDriverWait，并且您可以使用以下任何一个https://stackoverflow.com/a/48056120/7429447

使用CSS_SELECTOR： ( 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR，"ul.pvs-list>li span.hoverable-link-text span“)).text)
使用XPATH：打印(WebDriverWait(驱动程序，WebDriverWait“//ul@class=‘pvs’/li//跨包含(@class，‘hoverable-link-text’)//span”).text)
注意事项：您必须添加以下导入：从selenium.webdriver.support.ui导入WebDriverWait从selenium.webdriver.common.by导入从selenium.webdriver.support导入expected_conditions作为EC

您可以在如何使用Selenium - Python检索WebElement的文本中找到相关的讨论

票数 0

Stack Overflow用户

发布于 2022-08-18 10:04:39

谢谢大家！

最后，我得到了下面的代码。

get_education_school = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span[1]")))]

get_education_years = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 't-14 t-normal t-black--light')]//span[1]")))]

results_education_school = []
results_education_years = []
for i,j in zip(get_education_school, get_education_years):
    results_education_school.append(i)
    results_education_years.append(j)

print(results_education_school)
print(results_education_years)

票数 1

Stack Overflow用户

发布于 2022-08-12 15:08:13

我会先拿到教育部门的名单。

education_list = driver.find_element(By.CSS_SELECTOR, 'ul.pvs-list')
# loop through education_list for place and years
# would recommend relative locators for this task.
# find the image and get the first and second span with text inside of them.

我现在正在向代码中添加更多的细节。请稍等。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73336045

复制

相似问题

问如何使用Selenium和Python从Linkedin页面中刮取嵌套数据
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Selenium和Python从Linkedin页面中刮取嵌套数据EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Selenium和Python从Linkedin页面中刮取嵌套数据
EN