首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将PDF转换为Excel/csv/xlsx

将PDF转换为Excel/csv/xlsx
EN

Stack Overflow用户
提问于 2019-02-18 00:36:01
回答 1查看 5.9K关注 0票数 1

My intention将pdf字符串转换为excel/csv文件,如下所示:

PDF文件:(源文件)

代码语言:javascript
复制
#_________________________________________________________________________
appliance
n. 1. See server appliance. 2. See information appliance. 3. A device with a single or limited ......

appliance server
n. 1. An inexpensive computing .....2. See server appliance. 

application
n. A program designed ......
#________________________________________________________________________

Excel File : (Target File)
#________________________________________________________________________
appliance              , n. ,          1. See server appliance    ,
appliance server       , n. ,          1. An inexpensive co       ,
application            , n. ,          A program designed ......  ,  
_#_______________________________________________________________________  

我已经将pdf转换为文本,并尝试使用",“拆分,然后将文本文件转换为csv文件。但是我在把pdf转换成文本文件后就卡住了。

代码语言:javascript
复制
import os
from os import chdir, getcwd, listdir, path
import PyPDF2
from time import strftime
def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print ("\nThe specified path does not exist.\n")
        abs_path = raw_input(prompt)
    return abs_path
print ("\n")
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)
            list.append(t)
m=len(list)
i=0
while i<=len(list):
    path=list[i]
    head,tail=os.path.split(path)
    var="\\"
    tail=tail.replace(".pdf",".txt")
    name=head+var+tail
    content = ""
    # Load PDF into pyPDF
    pdf = PyPDF2.PdfFileReader(filename(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    print (strftime("%H:%M:%S"), " pdf  -> txt ")
    f=open(name,'w')
    f.write(content.encode("UTF-8"))
    f.close
EN

回答 1

Stack Overflow用户

发布于 2019-02-22 01:11:01

可能需要先将PDF转换为CSV格式,然后再将CSV格式转换为您想要的布局。

该接口支持Python将一个或多个PDF文件转换为CSV:https://pdftables.com/pdf-to-excel-api格式。

要转换单个PDF,请执行以下操作:

代码语言:javascript
复制
import pdftables_api

c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output.csv')

或转换多个PDF:

代码语言:javascript
复制
import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.csv')
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54735284

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档