文章/答案/技术大牛

发布

社区首页 >问答首页 >将PDF页合并为1个文件，而不生成单个页面文件

问将PDF页合并为1个文件，而不生成单个页面文件
EN

Stack Overflow用户

提问于 2022-02-08 18:32:36

回答 1查看 1.2K关注 0票数 1

其目标是获取一组jpg/tif图像，并将它们转换为1文本可搜索的PDF。我正在使用Python的PyPDF2和pytesseract来完成这一任务；但是，如果不将每个页面保存为自己的PDF，则无法找到组合这些页面的方法。事实证明，其中一些集可能是1k-10k页，因此不幸的是，单独保存每个页面不再可行.到目前为止，我得到的是：

# Convert each image to a searchable PDF
for fileset in filesets:
    merger = PdfFileMerger()
    page_path = fr".\output\pages"
    for file in fileset:
        # Load image, read with pytesseract
        path = os.path.join(download_location,file)
        img = cv2.imread(path,1)
        result = (pytesseract.image_to_pdf_or_hocr(img, lang="eng",config=tessdata_dir_config))
        # Save result as PDF
        f = open(os.path.join(path_out,getfilename.findall(file)[0])+".pdf","w+b")
        f.write(bytearray(result))
        f.close()

它可以很好地用于单个页面，从这里开始，我可以合并每个页面并将它们保存为一个文档，例如：

# pdfs is a list of all the single page pdf's
for page in pdfs: 
    merger.append(page)

merger.write(fr".\output\{FILE}.pdf")
merger.close();
del merger
    
# Get rid of single page files
for page in pdfs: 
    os.remove(page)

这会产生文本可搜索的PDF格式，但这些单独的页面文件将破坏我的内存。我尝试将result对象附加到merger中，这会产生AttributeError: 'bytearray' object has no attribute 'seek'错误。我还尝试将result对象作为PDF与PyPDF2.PdfFileReader()一起阅读，并得到了类似的结果。有什么想法吗？

python

pdf

merge

pypdf2

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-14 15:37:32

您需要使用BytesIO

for fileset in filesets:
    merger = PdfFileMerger()
    page_path = fr".\output\pages"
    for file in fileset:
        # Load image, read with pytesseract
        path = os.path.join(download_location,file)
        img = cv2.imread(path,1)
        result = pytesseract.image_to_pdf_or_hocr(img, lang="eng",config=tessdata_dir_config)
        merger.append(BytesIO(result))

merger.write(fr".\output\{FILE}.pdf")

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71039075

复制

相似问题

问将PDF页合并为1个文件，而不生成单个页面文件
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将PDF页合并为1个文件，而不生成单个页面文件EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将PDF页合并为1个文件，而不生成单个页面文件
EN