我尝试使用pdfminer.six将一个目录中的多个pdfs转换为多个.txt文件,使用python3.6.3
当运行下面的代码时,我得到了以下错误:ModuleNotFoundError:没有名为'pdfminer'的模块。或者,当我运行pdf2txt.py filename.pdf时,它提供了therenv:\r:没有这样的文件或目录
我对这个问题做了一些研究。我已经删除了原来的pdfminer从我的pip,目前,它只有pdfminer.six在我的pip。此外,我还在虚拟环境下运行python 3.6.3。
下面是我运行的代码:
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
#converts all pdfs in directory pdfDir, saves all resulting txt files to txtdir
def convertMultiple(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
fileExtension = pdf.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = pdfDir + pdf
text = convert(pdfFilename) #get string of text content of pdf
textFilename = txtDir + pdf + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
pdfDir = "../../data/raw/"
txtDir = "../../data/interim/"
convertMultiple(pdfDir, txtDir)发布于 2018-02-05 14:05:17
安装Python3.x的包pdfminer3k
发布于 2018-04-14 05:36:18
下载pdfminer3k tar.gz解压缩它运行python setup.py install
https://stackoverflow.com/questions/48056684
复制相似问题