首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >python下载/从urls列表中刮取ssrn文件

python下载/从urls列表中刮取ssrn文件
EN

Stack Overflow用户
提问于 2022-05-23 22:55:37
回答 1查看 109关注 0票数 1

我有一些链接是完全相同的,除了在结尾的id。我所要做的就是循环每个链接,并下载文件作为PDF使用下载作为PDF按钮。在一个理想的世界里,文件名将是标题的论文,但如果这是不可能的,我可以改名以后。让他们全部下载更重要。我有200个链接,但我将在这里提供5个例子。

代码语言:javascript
复制
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134

我想做什么是可能的吗?我有一些熟悉循环通过URL刮表,但我从来没有尝试用下载按钮做任何事情。

我没有示例代码,因为我不知道从哪里开始。但就像

代码语言:javascript
复制
for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-05-23 23:12:14

尝试:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

urls = [
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}


for url in urls:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    pdf_url = (
        "https://papers.ssrn.com/sol3/"
        + soup.select_one("a[data-abstract-id]")["href"]
    )
    filename = url.split("=")[-1] + ".pdf"

    print(f"Downloading {pdf_url} as {filename}")

    with open(filename, "wb") as f_out:
        f_out.write(
            requests.get(pdf_url, headers={**headers, "Referer": url}).content
        )

指纹:

代码语言:javascript
复制
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf

并将PDF保存为:

代码语言:javascript
复制
andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root  993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root  685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root  939157 máj 24 01:10 3860262.pdf
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72355572

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档