首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何使用Selenium和Python从Linkedin页面中刮取嵌套数据

如何使用Selenium和Python从Linkedin页面中刮取嵌套数据
EN

Stack Overflow用户
提问于 2022-08-12 14:55:26
回答 5查看 161关注 0票数 1

我正坐在为我的硕士项目,我想刮LinkedIn。就我现在而言,当我想刮用户的教育页面时,我遇到了一个问题。https://www.linkedin.com/in/williamhgates/details/education/)

我想对用户的所有教育内容进行清查。在这个例子中,我想在mr1 hoverable-link-text t-bold下刮起“哈佛大学”,但我看不出来。

以下是Linkedin代码中的HTML:

代码语言:javascript
复制
<li class="pvs-list__paged-list-item artdeco-list__item pvs-list__item--line-separated " id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0">
                        <!----><div class="pvs-entity
    pvs-entity--padded pvs-list__item--no-padding-when-nested
    
    ">
  <div>
        <a class="optional-action-target-wrapper 
        display-flex" target="_self" href="https://www.linkedin.com/company/1646/">
        <div class="ivm-image-view-model  pvs-entity__image ">
    <div class="ivm-view-attr__img-wrapper ivm-view-attr__img-wrapper--use-img-tag display-flex
    
    ">
<!---->      <img width="48" src="https://media-exp1.licdn.com/dms/image/C4E0BAQF5t62bcL0e9g/company-logo_100_100/0/1519855919126?e=1668643200&amp;v=beta&amp;t=BL0HxGNOasVbI3u39HBSL3n7H-yYADkJsqS3vafg-Ak" loading="lazy" height="48" alt="Harvard University logo" id="ember59" class="ivm-view-attr__img--centered EntityPhoto-square-3  lazy-image ember-view">
</div>
  </div>
    </a>

  </div>

  <div class="display-flex flex-column full-width align-self-center">
    <div class="display-flex flex-row justify-space-between">
          <a class="optional-action-target-wrapper 
          display-flex flex-column full-width" target="_self" href="https://www.linkedin.com/company/1646/">
        <div class="display-flex align-items-center">
            <span class="mr1 hoverable-link-text t-bold">
              <span aria-hidden="true"><!---->Harvard University<!----></span><span class="visually-hidden"><!---->Harvard University<!----></span>
            </span>
<!----><!----><!---->        </div>
<!---->          <span class="t-14 t-normal t-black--light">
            <span aria-hidden="true"><!---->1973 - 1975<!----></span><span class="visually-hidden"><!---->1973 - 1975<!----></span>
          </span>
<!---->      </a>


<!---->
      <div class="pvs-entity__action-container">
<!---->      </div>
    </div>

      <div class="pvs-list__outer-container">
<!---->    <ul class="pvs-list
        
        ">
        <li class=" ">
                <div class="pvs-list__outer-container">
<!----><!----><!----></div>

        </li>
    </ul>
<!----></div>
  </div>
</div>

                </li>

我尝试了以下代码:

代码语言:javascript
复制
education = driver.find_element("xpath", '//*[@id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0"]/div/div[2]/div[1]/a/div/span/span[1]/').text
print(education)

我不断地发现错误:

代码语言:javascript
复制
Message: no such element: Unable to locate element:

有人能帮忙吗?我希望有一个脚本,循环通过教育,并节省地方的教育和教育年。

EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2022-08-12 22:48:03

要提取文本,,哈佛大学,,理想情况下,您需要为https://stackoverflow.com/a/50474905/7429447引入WebDriverWait,并且您可以使用以下任何一个https://stackoverflow.com/a/48056120/7429447

  • 使用CSS_SELECTOR: ( 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"ul.pvs-list>li span.hoverable-link-text span“)).text)
  • 使用XPATH: 打印(WebDriverWait(驱动程序,WebDriverWait“//ul@class=‘pvs’/li//跨包含(@class,‘hoverable-link-text’)//span”).text)
  • 注意事项:您必须添加以下导入: 从selenium.webdriver.support.ui导入WebDriverWait从selenium.webdriver.common.by导入从selenium.webdriver.support导入expected_conditions作为EC

您可以在如何使用Selenium - Python检索WebElement的文本中找到相关的讨论

票数 0
EN

Stack Overflow用户

发布于 2022-08-18 10:04:39

谢谢大家!

最后,我得到了下面的代码。

代码语言:javascript
复制
get_education_school = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span[1]")))]

get_education_years = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 't-14 t-normal t-black--light')]//span[1]")))]

results_education_school = []
results_education_years = []
for i,j in zip(get_education_school, get_education_years):
    results_education_school.append(i)
    results_education_years.append(j)

print(results_education_school)
print(results_education_years)
票数 1
EN

Stack Overflow用户

发布于 2022-08-12 15:08:13

我会先拿到教育部门的名单。

代码语言:javascript
复制
education_list = driver.find_element(By.CSS_SELECTOR, 'ul.pvs-list')
# loop through education_list for place and years
# would recommend relative locators for this task.
# find the image and get the first and second span with text inside of them.

我现在正在向代码中添加更多的细节。请稍等。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73336045

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档