我需要从多个填好的pdfs中提取数据,有人能修改我的答案吗?
from fillpdf import fillpdfs
import os
import glob
import pdfplumber
import PyPDF2
for file in glob.glob("C:/Users/Sam/OneDrive/Desktop/Pro/*.pdf"):
with pdfplumber.open(file) as pdf:
page = pdf.pages[0]
text = page.extract_text()
text = " ".join(text.split())
print(text)发布于 2022-04-19 21:10:18
使用PDFplumber,请参阅GitHub:提取表格值上的文档。
import glob
import pdfplumber
def extract_text(pdf):
page = pdf.pages[0]
text = page.extract_text()
text = " ".join(text.split())
return text
def extract_form(pdf):
if "AcroForm" not in pdf.doc.catalog.keys():
return None
form = pdf.doc.catalog["AcroForm"]
form_data = {}
for field in form["Fields"]:
f = field.resolve()
field_name = f["T"]
field_value = f["V"] if 'V' in f.keys() else None
form_data[field_name] = field_value
return form_data
for file in glob.glob("output*.pdf"):
print(f"- File: {file}")
with pdfplumber.open(file) as pdf:
text = extract_text(pdf)
print(f"--- Text:\n{text}")
form = extract_form(pdf)
print(f"--- Form Data:\n{form}")指纹:
- File: output_filled.pdf
--- Text:
First Name: Last Name: Street Address: Apt #: Zipcode: City: Spirit: Date: Evil: Language: Français Esperanto Latin Signature:
--- Form Data:
{b'Given Name Text Box': b'Hello', b'Family Name Text Box': b'World', b'Address 1 Text Box': b'\xfe\xff', b'City Text Box': b'\xfe\xff', b'Gender List Box': b'Panda', b'Height Formatted Field': None, b'Driving License Check Box': /'Off', b'Apt Number Text Box': b'\xfe\xff', b'Zipcode Text Box': b'\xfe\xff', b'Latin Check Box': /'Off', b'French Check Box': None, b'Esperanto Check Box': None, b'signature_1': None}
- File: output.pdf
--- Text:
First Name: Faye Last Name: Valentine Street Address: 223B Baker Street Apt #: 6F Zipcode: 94455 City: London Spirit: Panda Date: 12/04/19 Evil: Language: Français Esperanto Latin Signature:
--- Form Data:
None用PDFjinja的例子的文件进行测试
sample.pdf (有表单数据值)output.pdf (只有文本)另请参阅
可能会感兴趣,相关的博客文章有关填写:
https://stackoverflow.com/questions/71857894
复制相似问题