文章/答案/技术大牛

发布

社区首页 >问答首页 >结果在句子之间产生不必要的额外行间隔

问结果在句子之间产生不必要的额外行间隔
EN

Stack Overflow用户

提问于 2016-02-05 18:26:15

回答 1查看 2.7K关注 0票数 2

我正在用tesseract做一些OCR手术。为此，我编写了一个简单的python包装器。问题是，在最后的文本文件中，我在句子之间出现了不必要的行间隔，我需要以编程的方式删除它。例如：

1 tbsp peanut or corn oil, plus a little
extra for Cooking the scallops

2 tbsp bottled mild or medium Thai
green curry paste
2 tbsp water

2 tsp light soy sauce

请注意一些线的空隙--我需要移除。如果你遇到类似的问题，请分享一些建议。谢谢。

下面是包装器：

from PIL import Image
import subprocess
import os
from wand.image import Image
import markdown2
from textblob import TextBlob

import util
import errors

tesseract_exe = "tesseract" # Name of executable to be called at command line
scratch_text_name_root = "temp" # Leave out the .txt extension
cleanup_scratch_flag = True # Temporary files cleaned up after OCR operation
pagesegmode = "-psm 0"


def call_tesseract(input_file, output_file):
    args = [tesseract_exe, input_file, output_file, pagesegmode]
    proc = subprocess.Popen(args)
    retcode = proc.wait()
    if retcode !=0:
        errors.check_for_errors()


def retrieve_text(scratch_text_name_root):
    inf = file(scratch_text_name_root + '.txt')
    text = inf.read()
    inf.close()
    return text

def write_to_file(filename, string):
    File = open(filename, 'w')
    File.write(string)
    File.close()


def image_to_string(filename):
    try:
        call_tesseract(filename, scratch_text_name_root)
        text = retrieve_text(scratch_text_name_root)
    finally:
        try:
            os.remove(scratch_text_name_root)
        except OSError:
            pass

        return text    

filename = "book/0001.bin.png"
text = image_to_string(filename)
print "writing to file"
write_to_file("0002.bin.txt", text)

python

tesseract

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-02-05 18:37:58

我不知道为什么tesseract会给你这些空行，但也许一个简单的解决方法可以帮助您：

把这些空行去掉。有很多方法可以做到这一点，例如，请看这里：https://stackoverflow.com/a/3711884/4175009

或者在这里：

https://stackoverflow.com/a/2369474/4175009

这些解决方案都假设您逐行读取文件。

我喜欢这个solution，因为您可以在完成的字符串中使用它，它处理行尾中的操作系统差异(\n，\n，\r\n)。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/35231009

复制

相似问题

问结果在句子之间产生不必要的额外行间隔
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问结果在句子之间产生不必要的额外行间隔EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问结果在句子之间产生不必要的额外行间隔
EN