我正在尝试使用以下代码来简化我的财务数据收集。然而,它似乎有几个问题。我想从下面的页面抓取一个特定的href:'https://www.witan.com/investor-information/factsheets/#currentPage=1‘
我尝试解析的href : href="/media/1767/witan-investment-trust_factsheet_310821.pdf“
目前我正在使用selenium来做这件事,但是它有点慢,所以如果可以使用BS4来抓取,我会公开征求建议--到目前为止,我的尝试都失败了。
# Set options for selenium
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--window-size=1920,1200")
# Requests website using Selenium & ChromeDriver
driver = webdriver.Chrome('C:/AnaConda/chromedriver.exe', options=options)
driver.get('https://www.witan.com/investor-information/factsheets/#currentPage=1') # Requests website
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
link_finder = soup.findAll('a', href=re.compile('/witan-investment-trust-factsheet'))[0]当使用上面的代码时,我得到:一个箭头“ico-class=- document-view size”href="/media/1750/witan-investment-trust-factsheet-30jun2021.pdf“target="_blank"...
希望有人能帮助我!
发布于 2021-09-17 11:48:51
带有PDF链接的HTML文档是通过JavaScript异步加载的(因此beautifulsoup在初始页面中看不到它们)。要打印所有PDF链接,您可以执行以下操作:
import requests
from bs4 import BeautifulSoup
api_url = "https://www.witan.com/umbraco/surface/listing/DocumentListing"
params = {
"currentPage": "1",
"year": "2021",
"isArchive": "false",
"pagination": "true",
}
with requests.session() as s:
# load cookies:
s.get("https://www.witan.com/investor-information/factsheets/")
# get document page:
soup = BeautifulSoup(s.get(api_url, params=params).content, "html.parser")
for a in soup.select(".document-view"):
print("https://www.witan.com" + a["href"])打印:
https://www.witan.com/media/1767/witan-investment-trust_factsheet_310821.pdf
https://www.witan.com/media/1763/witan-investment-trust_factsheet_310721.pdf
https://www.witan.com/media/1750/witan-investment-trust-factsheet-30jun2021.pdf
https://www.witan.com/media/1730/witan-investment-trust_factsheet_310521.pdf
https://www.witan.com/media/1718/witan-factsheet-30apr2021.pdfhttps://stackoverflow.com/questions/69222847
复制相似问题