文章/答案/技术大牛

发布

社区首页 >问答首页 >如何替换特定格式单词中的歧义字符

问如何替换特定格式单词中的歧义字符
EN

Stack Overflow用户

提问于 2021-02-01 22:11:13

回答 1查看 142关注 0票数 1

我使用tesseract OCR从不同的文档中提取一些文本，然后使用Regex处理提取的文本，以查看它是否与特定的模式匹配。不幸的是，OCR提取在有歧义的字符上会出现常见错误，例如: 5: S、1: I、0: O、2: Z、4: A、8: B等。这些错误是如此常见，以至于替换模糊的字符将与模式完美匹配。

有没有一种方法可以通过遵循特定的模式来后处理OCR提取和替换歧义字符(预先提供)？

预期输出(以及我到目前为止所能想到的)：

# example: I am extracting car plate numbers that always follow patern [A-Z]{2}\d{5}
# patterns might differ for other example, but will always be some alfa-numeric combination
# complex patterns may be ignored with some warning like "unable to parse" 

import re

def post_process(pattern, text, ambiguous_dict):
    # get text[0], check pattern
    # in this case, should be letter, if no, try to replace from dict, if yes, pass

    # continue with next letters until a match is found or looped the whole text

    if match:
        return match
    else:
        # some error message
        return None



ambiguous_dict = {'2': 'Z', 'B': '8'}

# My plate photo text: AZ45287
# Noise is fairly easy to filter out by filtering on tesseract confidence level, although not ideal
# so, if a function cannot be made that would find a match through the noise
# the noise can be ignored in favor of a simpler fucntion that can just find a match  
ocr_output = "someNoise A2452B7 no1Ze"  


# 2  in position 1is replaced by Z, B is replaced by 8. It would be acceptable if the function will
# while '2' on pos 5 should remain a 2 as per pattern
# do this iteratively for each element of ocr_output until pattern is matched or return None
# Any other functionally similar (recursive, generator, other) approach is also acceptable. 
result = post_process(r"[A-Z]{2}\d{5}", ocr_output, ambiguous_dict)

if result:
    print(result) # AZ45287
else: # result is none
    print("failed to clean output")

我希望我已经很好地解释了我的问题，但可以随时请求更多信息

python

regex

ocr

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-02-02 17:27:21

像OCR一样，很难想出一个100%安全有效的解决方案。在这种情况下，您可以做的是将“损坏的”字符添加到正则表达式中，然后使用带有替换的字典对匹配进行“规范化”。

这意味着您不能使用[A-Z]{2}\d{5}，因为在前两个大写字母中可以有一个8，而在这五个数字中可以有一个B。因此，您需要在此处将模式更改为([A-Z2]{2})([\dB]{5})。请注意创建两个子组的捕获括号。要对每一个进行标准化，您需要两个单独的替换项，因为您似乎不希望将数字替换为数字部分(\d{5})中的字母，并将字母替换为字母部分中的数字([A-Z]{2})。

因此，下面是如何在Python中实现它：

import re
def post_process(pattern, text, ambiguous_dict_1, ambiguous_dict_2):
    matches = list(re.finditer(pattern, text))
    if len(matches):
        return [f"{x.group(1).translate(ambiguous_dict_1)}{x.group(2).translate(ambiguous_dict_2)}" for x in matches]
    else:
        return None
 
ambiguous_dict_1 = {ord('2'): 'Z'} # For the first group
ambiguous_dict_2 = {ord('B'): '8'} # For the second group
 
ocr_output = "someNoise A2452B7 no1Ze" 
result = post_process(r"([A-Z2]{2})([\dB]{5})", ocr_output, ambiguous_dict_1, ambiguous_dict_2)
 
if result:
    print(result) # AZ45287
else: # result is none
    print("failed to clean output")

# => ['AZ45287']

请参阅Python demo

ambiguous_dict_1字典包含数字到字母的替换，ambiguous_dict_2包含字母到数字的替换。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65994204

复制

相似问题

问如何替换特定格式单词中的歧义字符
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何替换特定格式单词中的歧义字符EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何替换特定格式单词中的歧义字符
EN