文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用Python拆分PDF，每一页包含一组特定的唯一文本

问如何使用Python拆分PDF，每一页包含一组特定的唯一文本
EN

Stack Overflow用户

提问于 2022-01-22 22:10:10

回答 2查看 968关注 0票数 1

我有一个大的PDF文件，需要分割它每一个'X‘页，但'X’可能会有所不同。我需要它来分割每一页，当一个页面包含文本'Name:'，但是在'Name：‘之后的文本更改.

所以第1页可能有'Name: Sachin'，然后第2页也可能有'Name: Sachin'，但是第3页有'Name: Sarah'，所以它应该从第1页拆分到第2页，然后是第3页。

这是我找到的一个脚本，只是它在每一页上都会被分割，不管怎么说。

https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/

提前谢谢你，

萨钦

更新

下面是一些代码，它不管如何拆分每个页面，但是它检测到文本' name :‘之后的名称，并相应地重命名拆分的文件，它在文件名中有这个名称。

但是，我如何更新代码，以便如果有两个连续的页面具有相同的名称(在文本字段‘name：’之后)，它不会在该页上拆分，而是将两个具有相同名称的页面合并到一个pdf文件中？

再次感谢，

萨钦

import os
import re
from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_file_path = 'Payslips.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
output_folder_path = os.path.join(os.getcwd(), 'Output')

pdf = PdfFileReader(pdf_file_path)

for page_num in range(pdf.numPages):

    # Setup Objects & Classes
    pdfWriter = PdfFileWriter()
    pageObj = pdf.getPage(page_num)
    pdfWriter.addPage(pageObj)

    # Extract Text
    Text = pageObj.extractText() 

    # print(Text)
    MatchedTextArray = re.findall("Name:[^0-9]+?\s", Text)
    MatchedText = (MatchedTextArray[0].replace('Name:', '')).replace('\n', '')
   
    # Splitting on UpperCase
    res_pos = [i for i, e in enumerate(MatchedText+'A') if e.isupper()]
    res_list = [MatchedText[res_pos[j]:res_pos[j + 1]]
            for j in range(len(res_pos)-1)]

    # Extracting Firstname
    firstname = res_list[1]

    # Extracting Surname
    del res_list[0:2]
    surname = ''.join(res_list)


    with open(os.path.join(output_folder_path, 
        '{0}, {1} - {2}.pdf'.format(surname.upper(), firstname.upper(), file_base_name.upper())), 
        'wb') as f:
        pdfWriter.write(f)
        f.close()

    print("Split Page " + str(page_num))

pdf

python

回答 2

Stack Overflow用户

发布于 2022-01-22 22:25:23

像这样的事情应该有效：

import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def pdf_splitter(path):
    fname = os.path.splitext(os.path.basename(path))[0]
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
        output_filename = '{}_page_{}.pdf'.format(
            fname, page+1)
        if not your_condition: # only write of condition isn't met (anymore)
            with open("Give_it_a_name.txt", 'wb') as out:
                pdf_writer.write(out)
            print('Created: {}'.format("Give_it_a_name.txt"))
if __name__ == '__main__':
    path = 'w9.pdf'
    pdf_splitter(path)

票数 0

Stack Overflow用户

发布于 2022-01-29 21:44:00

好吧，我想我解决了

import os
import re
from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_file_path = 'Payslips.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
output_folder_path = os.path.join(os.getcwd(), 'Output')
pdf = PdfFileReader(pdf_file_path)

# Split Files
count = 0
for page_num in range(pdf.numPages):

    # Skip Parent Loop if needed
    if count > 0:
        count -= count
        continue
         
    # Setup Objects & Classes
    pdfWriter = PdfFileWriter()
    pageObj = pdf.getPage(page_num)
    pdfWriter.addPage(pageObj)

    # Search on Current Page
    Text = pageObj.extractText() 
    MatchedTextArray = re.findall("Name:[^0-9]+?\s", Text)
    MatchedText = (MatchedTextArray[0].replace('Name:', '')).replace('\n', '')

    # Search on following Pages
    i = page_num + 1
    while i < pdf.numPages:
        pageObjNext = pdf.getPage(i)
        TextNext = pageObjNext.extractText() 
        MatchedTextArrayNext = re.findall("Name:[^0-9]+?\s", TextNext)
        MatchedTextNext = (MatchedTextArrayNext[0].replace('Name:', '')).replace('\n', '')

        if MatchedText == MatchedTextNext:
            i += 1
            count += 1
            pdfWriter.addPage(pageObjNext)
        else:
            break

    # Splitting on UpperCase
    res_pos = [i for i, e in enumerate(MatchedText+'A') if e.isupper()]
    res_list = [MatchedText[res_pos[j]:res_pos[j + 1]] for j in range(len(res_pos)-1)]

    # Extracting Firstname
    firstname = res_list[1]

    # Extracting Surname
    surname = ''
    del res_list[0:2]
    if len(res_list) == 1:
        surname = surname + res_list[0]
    else:
        surname = surname + res_list[0]
        for i in (n+1 for n in range(len(res_list)-1)):
            if res_list[i-1][-1] == "-" or res_list[i-1][-1] == "'" :
                surname = surname + res_list[i]
            else:
                surname = surname + " " + res_list[i]
 
    # Write PDF File
    with open(os.path.join(output_folder_path, 
        '{0}, {1}'.format(surname.upper(), firstname.upper())), 'wb') as f:
        pdfWriter.write(f)
        f.close()

# Rename Files in Output Directory
files = os.listdir(output_folder_path)
for file in files:
    os.rename(os.path.join(output_folder_path, file), 
    os.path.join(output_folder_path, 'WE 25JAN 2022 - ' + file + ' - PAYSLIP' + '.pdf'))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70817546

复制

相似问题

问如何使用Python拆分PDF，每一页包含一组特定的唯一文本
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Python拆分PDF，每一页包含一组特定的唯一文本EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Python拆分PDF，每一页包含一组特定的唯一文本
EN