首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Tesseract训练.精练字符

Tesseract训练.精练字符
EN

Stack Overflow用户
提问于 2019-09-27 07:08:23
回答 1查看 4.7K关注 0票数 3

我想为一个新角色训练我现有的模型。我已经试过了

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line

(微调为±几个字符)(我使用MAC)

但不起作用。如果我评估(即使是在训练数据上),它就无法识别±字符。

我安装了:

代码语言:javascript
复制
    tesseract 5.0.0-alpha-447-g52cf
     leptonica-1.78.0
      libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1
     Found AVX2
     Found AVX
     Found FMA
     Found SSE
     Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6

via:

我将以下GitHub存储库克隆到我的桌面上并安装了tesseract:

https://github.com/tesseract-ocr/tesseract.git

https://github.com/tesseract-ocr/langdata_lstm

https://github.com/tesseract-ocr/tessdata_best

我的安装情况如下:

安装:

代码语言:javascript
复制
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc

代码语言:javascript
复制
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c

进入克隆的tesseract文件夹。

代码语言:javascript
复制
    ~/Desktop/tesseract

运行autogen.sh:

代码语言:javascript
复制
    ./autogen.sh

安装依赖关系:

代码语言:javascript
复制
    brew install cairo pango icu4c autoconf libffi libarchive libpng
    export PKG_CONFIG_PATH=\
    (brew --prefix)/lib/pkgconfig:\
    (brew --prefix)/opt/libarchive/lib/pkgconfig:\
    (brew --prefix)/opt/icu4c/lib/pkgconfig:\
    (brew --prefix)/opt/libffi/lib/pkgconfig:\
    (brew --prefix)/opt/libpng/lib/pkgconfig

(如果已经安装了一些,请使用重新安装而不是安装)

运行配置:

代码语言:javascript
复制
    ./configure

安装tesseract:

代码语言:javascript
复制
    make
    sudo make install

安装培训-工具:

代码语言:javascript
复制
    make training
    sudo make training-install

之后,我将eng.traineddata从tessdata_best插入到tesseract/tessdata

我的培训代码如下:

代码语言:javascript
复制
    # GENERATE TRAINING DATA
    rm -rf ~/Desktop/tesstutorial/trainplusminus/*
    PANGOCAIRO_BACKEND=fc \
    ~/Desktop/tesseract/src/training/tesstrain.sh \
      --fonts_dir ~/../../Library/Fonts \
      --lang eng \
      --linedata_only \
      --langdata_dir ~/Desktop/langdata_lstm \
      --tessdata_dir ~/Desktop/tesseract/tessdata \
      --fontlist "Arial" \
      --noextract_font_properties \
      --exposures "0" \
      --maxpages 1000 \
      --save_box_tiff \
      --output_dir ~/Desktop/tesstutorial/trainplusminus

    # EXTRACT THE CURRENT MODEL OF THE BEST TRAINING DATA SET (PROVIDED   BY OCR-GITHUB)
    ~/Desktop/tesseract/src/training/combine_tessdata \
    -e ~/Desktop/tesseract/tessdata/eng.traineddata  ~/Desktop/tesstutorial/trainplusminus/eng.lstm

    # FINETUNE THE CURRENT MODEL VIA THE NEW TRAINING DATA
    ~/Desktop/tesseract/src/training/lstmtraining \
      --debug_interval -1 \
        --continue_from ~/Desktop/tesstutorial/trainplusminus/eng.lstm \
        --model_output ~/Desktop/tesstutorial/trainplusminus/plusminus \
        --traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \
        --old_traineddata ~/Desktop/tesseract/tessdata/eng.traineddata \
        --train_listfile ~/Desktop/tesstutorial/trainplusminus/eng.training_files.txt \
        --max_iterations 5000

    # COMBINE THE NEW BEST TRAINING DATA
    lstmtraining --stop_training \
      --continue_from ~/Desktop/tesstutorial/trainplusminus/plusminus_checkpoint \
      --traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \
      --old_traineddata ~/Desktop/tesseract/tessdata/eng.traineddata \
      --model_output ~/Desktop/tesstutorial/trainplusminus/eng.traineddata

我不知道为什么这段代码没有产生我期望的结果。我试着训练一种新的字体和上面的代码工作。为了对新字符进行微调,我更改的惟一方法是向langdata_lstm/eng/eng.text添加文本:

代码语言:javascript
复制
    alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
    TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
    Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
    VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
    PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
    Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
    Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
    Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
    Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
    United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
    Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
    Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
    netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED

谢谢你的帮助!

达斯汀

EN

回答 1

Stack Overflow用户

发布于 2020-03-20 14:58:31

如果您在培训后得到的eng.traineddata文件对所有字符和整数都有效,唯一的问题是它不识别您刚才试图添加的"±“符号,那么尝试以下操作:

times.

  • --max_iterations

  • 确保"±“存在于eng.charset_size=xx和eng.unicharset文件中。在engdata_lstm/eng/eng.Train_文本文件中,使用大约2000行与"±”一起出现的“±”应该至少为3000行,以便使用新字符

进行细化。

希望这能帮上忙。谢谢你的问题帮了我..。:)

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58129505

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档