文章/答案/技术大牛

发布

社区首页 >问答首页 >无法使用Selenium找到正确的href /BS4

问无法使用Selenium找到正确的href /BS4
EN

Stack Overflow用户

提问于 2021-09-17 11:39:28

回答 1查看 35关注 0票数 1

我正在尝试使用以下代码来简化我的财务数据收集。然而，它似乎有几个问题。我想从下面的页面抓取一个特定的href：'https://www.witan.com/investor-information/factsheets/#currentPage=1‘

我尝试解析的href : href="/media/1767/witan-investment-trust_factsheet_310821.pdf“

目前我正在使用selenium来做这件事，但是它有点慢，所以如果可以使用BS4来抓取，我会公开征求建议--到目前为止，我的尝试都失败了。

# Set options for selenium
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--window-size=1920,1200")

# Requests website using Selenium & ChromeDriver
driver = webdriver.Chrome('C:/AnaConda/chromedriver.exe', options=options)
driver.get('https://www.witan.com/investor-information/factsheets/#currentPage=1') # Requests website
html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
link_finder = soup.findAll('a', href=re.compile('/witan-investment-trust-factsheet'))[0]

当使用上面的代码时，我得到:一个箭头“ico-class=- document-view size”href="/media/1750/witan-investment-trust-factsheet-30jun2021.pdf“target="_blank"...

希望有人能帮助我！

web-scraping

beautifulsoup

selenium

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-09-17 11:48:51

带有PDF链接的HTML文档是通过JavaScript异步加载的(因此beautifulsoup在初始页面中看不到它们)。要打印所有PDF链接，您可以执行以下操作：

import requests
from bs4 import BeautifulSoup

api_url = "https://www.witan.com/umbraco/surface/listing/DocumentListing"

params = {
    "currentPage": "1",
    "year": "2021",
    "isArchive": "false",
    "pagination": "true",
}

with requests.session() as s:
    # load cookies:
    s.get("https://www.witan.com/investor-information/factsheets/")
    # get document page:
    soup = BeautifulSoup(s.get(api_url, params=params).content, "html.parser")
    for a in soup.select(".document-view"):
        print("https://www.witan.com" + a["href"])

打印：

https://www.witan.com/media/1767/witan-investment-trust_factsheet_310821.pdf
https://www.witan.com/media/1763/witan-investment-trust_factsheet_310721.pdf
https://www.witan.com/media/1750/witan-investment-trust-factsheet-30jun2021.pdf
https://www.witan.com/media/1730/witan-investment-trust_factsheet_310521.pdf
https://www.witan.com/media/1718/witan-factsheet-30apr2021.pdf

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69222847

复制

相似问题

问无法使用Selenium找到正确的href /BS4
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用Selenium找到正确的href /BS4EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用Selenium找到正确的href /BS4
EN