我正在尝试处理pdfplumber的在线pdf文件。但是pdfplumber似乎不支持urlopen,它只支持pdfplumber.open来处理本地pdf文件。
有什么解决方案吗?
fp = pdfplumber.open("t1.PDF")
#fp = pdfplumber.urlopen('https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0124/2020012400942_c.pdf')发布于 2021-05-13 21:36:28
示例
from io import BytesIO
import requests
import pdfplumber
req = requests.get("https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0124/2020012400942_c.pdf")
with pdfplumber.load(BytesIO(req.content)) as pdf:
print(pdf.pages[0].extract_text())发布于 2021-06-09 19:21:47
pdfplumber的存储库是here。如果您只需要处理pdf文件,则不必下载pdf。您可以直接访问链接,如下所示:
import urllib3
import pdfplumber
import io
def extract_text_from_pdf_by_url(url):
all_text = ''
http = urllib3.PoolManager()
temp = io.BytesIO()
temp.write(http.request("GET", url).data)
try: # to verify is the url has valid pdf file!
pdf = pdfplumber.open(temp)
for pdf_page in pdf.pages:
single_page_text = pdf_page.extract_text()
# TypeError: can only concatenate str (not "NoneType") to str
if single_page_text is not None:
all_text += '\n' + single_page_text
pdf.close()
except:
pass
return all_text
url = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0124/2020012400942_c.pdf'
text = extract_text_from_pdf_by_url(url)
print(text)https://stackoverflow.com/questions/59905947
复制相似问题