首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Pdfminer python 3.5

Pdfminer python 3.5
EN

Stack Overflow用户
提问于 2016-10-04 14:24:27
回答 5查看 37.4K关注 0票数 21

我已经完成了一些教程,但是我无法运行这个代码块,我完成了从StringIO到BytesIO的必要切换(我相信?)

我不知道为什么‘香蕉’什么都不印,我认为错误可能是红鲱鱼?这与我遵循python2.7教程并试图将其翻译成python3有关吗?

代码语言:javascript
复制
errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>
    banana = convert("A1.pdf")
  File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert
    infile = file(fname, 'rb')
NameError: name 'file' is not defined

脚本

代码语言:javascript
复制
from io import BytesIO

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = BytesIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

banana = convert("A1.pdf")
print(banana)

同样的情况也发生在这个变体上:

代码语言:javascript
复制
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Banana = convert_pdf_to_txt("A1.pdf")
print(Banana)

我已经尝试过搜索这个(大部分pdfminer代码来自),但是没有结果。

任何洞察力都是值得赞赏的。

干杯

EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2016-11-29 22:43:00

--Python3.5的解决方案:您需要pdfminer.six。在win10下,我可以轻松地用

代码语言:javascript
复制
pip install pdfminer.six

您可以使用

代码语言:javascript
复制
pdfminer.__version__

我还没有对它进行深入测试。但我可以为转换pdf、→文本、pdf→html运行以下代码

票数 36
EN

Stack Overflow用户

发布于 2016-12-06 10:02:19

改进解决方案 (Dez 2016)

代码语言:javascript
复制
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages: pagenums = set();
    else:         pagenums = set(pages);      
    manager = PDFResourceManager() 
    codec = 'utf-8'
    caching = True

    if case == 'text' :
        output = io.StringIO()
        converter = TextConverter(manager, output, codec=codec, laparams=LAParams())     
    if case == 'HTML' :
        output = io.BytesIO()
        converter = HTMLConverter(manager, output, codec=codec, laparams=LAParams())

    interpreter = PDFPageInterpreter(manager, converter)   
    infile = open(fname, 'rb')

    for page in PDFPage.get_pages(infile, pagenums,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()  

    infile.close(); converter.close(); output.close()
    return convertedPDF

#//////////// main ///////////////////////
filePDF  = 'myDir//myPDF.pdf'     # input
fileHTML = 'myDir//myHTML.html'   # output
fileTXT  = 'myDir//myTXT.txt'     # output

case = "HTML"

if case == 'HTML' :
    convertedPDF = convert('HTML', filePDF, pages=[0,1])
    fileConverted = open(fileHTML, "wb", encoding="utf-8")
if case == 'text' :
    convertedPDF = convert('text', filePDF, pages=[0,1])
    fileConverted = open(fileTXT, "w", encoding="utf-8")

fileConverted.write(convertedPDF)
fileConverted.close()
#print(convertedPDF) 
票数 13
EN

Stack Overflow用户

发布于 2018-10-14 13:56:08

在Python3.7上的例子中,我尝试使用它,它对我来说很有魅力!

下面是我使用的代码:

代码语言:javascript
复制
def convert_pdf_to_txt(path_to_file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path_to_file, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/39854841

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档