首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用selenium从urls列表下载多个pdfs

使用selenium从urls列表下载多个pdfs
EN

Stack Overflow用户
提问于 2020-01-24 14:22:01
回答 1查看 243关注 0票数 0

嗨,我有下面的代码,但它似乎没有下载到我的文件夹。我能够提取一个包含PDF的url列表,但是我无法使我的代码下载PDF工作。

代码语言:javascript
复制
driver = webdriver.Chrome(
executable_path=os.path.join(GenericMethods.get_full_path_to_folder('drivers'), "chromedriver.exe"),
chrome_options=chrome_options)
download_dir = "D:\VIX\FOMC_Minutes"
months = ['January', 'February', 'March', 'April', 'May', 'June',
      'July', 'August', 'September', 'October', 'November', 'December']

years = ['1983', '1982', '1981', '1980', '1979', '1978', '1977', '1976', '1975', '1974',
     '1973', '1972', '1971', '1970', '1969', '1968', '1967', '1966', '1965', '1964',
     '1963', '1962', '1961', '1960']

driver.get(f'https://fraser.stlouisfed.org/title/677')
driver.set_page_load_timeout(15)
pdf_links=[]
search = driver.find_element_by_class_name('list-search.form-control.input-sm')
for y in years:

for m in months:
    print (m + "-" + y)
    search.clear()
    search.send_keys(y)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, "//span[contains(., '" + y + "')]/parent::a")))
    xpath = "//span[contains(., '" + m + "') and contains(., '" + y + "')]/parent::a"
    elinks = driver.find_elements_by_xpath(xpath)
    if len(elinks)>0:
        print(elinks[0].get_attribute('href'))
        pdf_links.append(elinks[0].get_attribute('href'))

driver.quit()
print(pdf_links)

for url in pdf_links:
page = driver.get(url)
elinks = page.find_elements_by_css_selector("a[href*='.pdf']")
for elink in elinks:
download = driver.find_element_by_link_text(elink)
download.click()
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-24 14:27:08

您必须对脚本稍加修改,并使用wget相当标准的下载。这样的话,下载就容易了,速度也更快了。

请确保添加此导入。

代码语言:javascript
复制
import wget
代码语言:javascript
复制
driver.set_page_load_timeout(15)
pdf_links=[]

for y in years:

    for m in months:
        driver.get(f'https://fraser.stlouisfed.org/title/677')
        search = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.CLASS_NAME, "list-search.form-control.input-sm")))
        search.clear()
        search.send_keys(y)
        WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, "//span[contains(., '" + y + "')]/parent::a")))
        xpath = "//span[contains(., '" + m + "') and contains(., '" + y + "')]/parent::a"
        elinks = driver.find_elements_by_xpath(xpath)
        if len(elinks)>0:
            driver.get(elinks[0].get_attribute('href'))
            # linkEle = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, "//a[@class='btn btn-default btn-sm btn-block']")))
            linkEle = driver.find_elements_by_xpath("//a[@class='btn btn-default btn-sm btn-block']")
            if len(linkEle) >0:
                wget.download(linkEle[0].get_attribute('href'), m + "_" + y + ".pdf")
                # if you want to download to any specific directory then append the dir path as shown below.
                wget.download(linkEle[0].get_attribute('href'), dir_path + "/" + m + "_" + y + ".pdf")
driver.quit()
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59898313

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档