文章/答案/技术大牛

发布

问将PDF转换为XLS
EN

Stack Overflow用户

提问于 2021-10-20 11:41:01

回答 1查看 210关注 0票数 0

我想把PDF文件转换成CSV或XLS。我尝试过使用python tabula来完成此操作：

#!/bin/bash
#!/usr/bin/env python3
import tabula

# Read pdf into list of DataFrame
df = tabula.read_pdf("File1.pdf", pages='all')

# convert PDF into CSV file
tabula.convert_into("File1.pdf", "File1.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
#tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

尽管python脚本将PDF转换为CSV，但decimal不正确。

例如，1.25仅显示为1.2。

因此，我希望将小数位数增加到2，以便在转换后的CSV文件中获得正确的数字。

有没有人能帮我一下？

谢谢。

tabula

python

pdf

python-3.7

pdftotext

回答 1

Stack Overflow用户

发布于 2021-10-24 14:54:04

根据需要，我们需要调整tabula上的参数，以便数据导入有意义。我在评论中建议的参数只是一个例子。要获得从x轴开始的列，我们需要使用acrobat的付费版本或使用一些跟踪。

所以代码应该是这样的

导入和设置

import tabula
import pandas as pd
pdf_file='file1.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']
df_results=[] # store results in a list

由于页面的格式不同，我们需要分别处理它们。还有一些清理，比如删除不需要的列或某些值之后的数据(参见第2页处理)

# Page 1 processing
try:
    df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                          410,450,480,520]
                         ,pandas_options={'header': None}) #(top,left,bottom,right)
    df1[0]=df1[0].drop(columns=5)
    df1[0].columns=column_names
    df_results.append(df1[0])
    df1[0].head(2)
    
except Exception as e:
    print(f"Exception page not found {e}")

# Page 2 processing
try:
    df2 = tabula.read_pdf(pdf_file, pages=3,area=(10,20, 800, 840),columns=[93,180,220,252,310,315,330,370,
                                                                          410,450,480,520]
                         ,pandas_options={'header': None}) #(top,left,bottom,right)

    row_with_Sta = df2[0][df2[0][0] == 'Sta'].index.tolist()[0]
    df2[0] = df2[0].iloc[:row_with_Sta]
    df2[0]=df2[0].drop(columns=5)
    df2[0].columns=column_names
    df_results.append(df2[0])
    df2[0].head(2)
except Exception as e:
    print(f"Exception page not found {e}")

#result = pd.concat([df1[0],df2[0]]) # concate both the pages and then write to CSV
result = pd.concat(df_results) # concate list of pages and then write to CSV
result.to_csv("result.csv")

请测试代码，因为我只有一定程度的验证:)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69645338

复制

相似问题

问将PDF转换为XLS
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将PDF转换为XLSEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将PDF转换为XLS
EN