blocks|key|2365324|text|扫描前把中间的页码切下来。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2365325|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Cut the pages down the middle before you scan.

blocks|key|951466|text|这取决于您正在使用的OCR软件。几年前，我用OCR做了一些工作，我不太记得它的名字，但我认为有很多替代方案。无论如何，这个API允许我将页面上的区域定义为OCR，如果您总是大致知道列在哪里，可以使用SDK来映射页面的部分内容。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|951467|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.

blocks|key|2365370|text|我用Omnipage+17来做这样的事情。它也有一个批处理模式，您可以将文档放在文件夹中，在文件夹中抓取文档，并将结果放入另一个文件夹。它自动识别布局、包含列，或者可以将默认布局设置为列。您可以设置输出应该是什么样的许多选项。但是尝试一个演示，如果它是正确的。我目前在一些文件中遇到了一些结束语的问题。所以像"fliegen“这样的词是"fl+iegen”，所以你必须拼写它们。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2365371|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.

blocks|key|951510|text|看看cloud.shtml+(一个用于OCR的在线REST+)。它是基于强大的ABBYY+OCR引擎。您可以获得一个免费帐户，并尝试使用您的一些图像，看看它是否处理2列格式(它应该能够做到)。另外，您还可以使用一些设置(参见API文档)--在使用2列之前，您可能必须对其中的一些设置进行调整。最后，作为最后的解决方案，如果2列分割总是在同一个位置，那么您可以首先创建一个程序，将输入的图像分割成两个图像(使用标准的图像处理库编写它应该不是很困难)，然后将结果图像提供给OCR进程。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|951511|entityMap|0|LINK|mutability|MUTABLE|url|http://www.wisetrend.com/wisetrend_ocr_cloud.shtml^0|2|B|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

Take a look at <a href="http://www.wisetrend.com/wisetrend_ocr_cloud.shtml" rel="nofollow">http://www.wisetrend.com/wisetrend_ocr_cloud.shtml</a> (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.

The problem is that most of these files have a two-column format:

<a href="http://sert.homedns.org/img/btp12001.png" rel="nofollow noreferrer">Sample Protocol http://sert.homedns.org/img/btp12001.png</a>

I would love to read your answer to my following questions:

<ol>
<li>How I can split the two columns before feeding them into OCR?</li>
<li>Which commercial, open-source OCR software or framework, do you recommend and why?</li>
</ol>

Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!

UPDATE: These documents are already scanned by the parliament o_O: <a href="http://dip.bundestag.de/btp/12/12001.pdf" rel="nofollow noreferrer">sample</a> (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.

Best Regards, 
Cetin Sert

optical character recognition of PDFs of parliamentary debates

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

对于一个合同工作，我需要数字化的许多旧的，扫描-图形的全体辩论协议PDF来自德国联邦议会。问题是，这些文件大多采用两列格式：我很想看看你对我以下问题的回答：在输入OCR之前，我如何分割这两列？您推荐哪种商业开放源码的OCR软件或框架?为什么？请注意，任何工具，编程语言，框架等都是好的。不要犹豫，推荐深奥的产品，图书馆，如果你认为他们是为朱布^__^！更新：这些文档已经被议会o_O： (与上面的图像

问议会辩论中PDF的光学字符识别
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问议会辩论中PDF的光学字符识别EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问议会辩论中PDF的光学字符识别
EN