我想阅读两个PDF文件从URL没有下载。然后,我想使用pdftotext提取文本。
import pdftotext
with open("pdf_path1", "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
with open("b.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))如何解决此错误?或者有其他技术可以从URL读取PDF吗?
发布于 2022-05-30 21:23:32
您可以直接从url打开文件,然后使用urllib.request作为pdf进行操作:
import pdftotext
from urllib.request import urlopen
target_url = "https://arxiv.org/pdf/2012.05439.pdf" # to change.
file = urlopen(target_url)
pdf = pdftotext.PDF(file) # add password if password protected.
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))发布于 2022-05-30 23:05:44
您不能阅读一些人PDF在线,它必须是您的副本(所有PDF必须下载)。您的计算机只能处理本地HTML页面及其内容,这是过去的情况,现在仍然是:-
web是如何在一行中工作的(更多的图形方法可用)。
<A HyperRef=HTextTransferProtocol://www.website.html>download to view our BBS pages</a>
curl -o temp.pdf https://arxiv.org/pdf/2012.05439.pdf & pdftotext -layout -f 1 -l 1 temp.pdf -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1318k 100 1318k 0 0 488k 0 0:00:02 0:00:02 --:--:-- 488k
Scheduling Beyond CPUs for HPC....

https://stackoverflow.com/questions/72438825
复制相似问题