文章/答案/技术大牛

发布

社区首页 >问答首页 >tesseract无法拾取页面右侧的字符

问tesseract无法拾取页面右侧的字符
EN

Stack Overflow用户

提问于 2020-06-06 04:43:04

回答 3查看 564关注 0票数 2

当遍历pdf页面时，tesseract识别一个页面上的字符，类似于：

Table 1 Summary Data                    3
Table 2 Unique  Data                    5

但在另一页上

Table 3  Reservoir Data                 8
Table 4  Surface Data                   9

它去掉最后一个数字，因此输出类似于

Table 3  Reservoir Data                
Table 4  Surface Data

数字8和9不会被解释。我检查了从pdf2image创建的图像

pages = convert_from_path(pdf_path, 500)

最右边的文本出现在页面图像中。

但是，下面代码中的dataframe (df)没有包含任何有关页面的最右边的数据，也没有尝试任何看起来像是识别的字符。pdf页面和图像具有相同的质量，最右边的字符位于相同的水平位置。

这是我使用的代码：

    custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)

        for pageNum,imgBlob in enumerate(pages):
            if pageNum < 8:
                if pageNum == 6:
                    d = pytesseract.image_to_data(imgBlob, config=custom_config, output_type=Output.DICT)
                    df = pd.DataFrame(d)

                    print(pageNum)
                    print(df)

我想知道是否有一个水平限制或边界，使tesseract无法读取超出，并将dpi改为400 -我假设500是dpi。在谷歌上搜索裁剪、边距或跳过等术语时，我找不到任何相关的内容。

tesseract

python-tesseract

python

ocr

回答 3

Stack Overflow用户

发布于 2020-06-10 16:59:09

检查使用不同的页面分割模式是否会产生更好的结果

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 6 -l eng+ita'

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

票数 2

Stack Overflow用户

发布于 2020-10-12 16:33:22

我在tesseract4上也遇到过同样的问题，@K41F4r的解决方案适用于我在页面分割模式下的值12(稀疏文本)。

票数 0

Stack Overflow用户

发布于 2021-06-14 21:11:50

这是一个页面分割模式的问题。-- psm 3无法检测图像中的稀疏字符。使用psm 6、11或12。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62223815

复制

相似问题

问tesseract无法拾取页面右侧的字符
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问tesseract无法拾取页面右侧的字符EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问tesseract无法拾取页面右侧的字符
EN