Differences between revisions 9 and 10
Revision 9 as of 2017-12-30 19:26:15
Size: 5477
Editor: weh
Comment:
Revision 10 as of 2017-12-30 19:26:47
Size: 5507
Editor: weh
Comment:
Deletions are marked like this. Additions are marked like this.
Line 35: Line 35:
tesstrain.sh has a flat to list all fonts on the system using the name with which it accepts them. if it prints out a unicode name in this view then it cannot be used on windows because the commands tesstrain.sh calls are not aware of _wfopen(). tesstrain.sh has a flat to list all fonts on the system using the name with which it accepts them. if a given font prints out a unicode name in this view then that font cannot be used in this step on windows because the commands tesstrain.sh calls are not aware of _wfopen().

tesseract is an old commercial OCR system released as open source and revived by google

tesseract 4 has a long-short-term-memory neural network in it to remove the ceiling on text recognition accuracy that the old text recognition method had

google has private internal tools and training sets that they don't release to the public

they probably appropriate a bunch of commercial material for it and the training sets required for neural network training are so large that the amount released from individual sources would not be fair use

how to extend existing tesseract 4 traineddata files to recognize new characters or fonts

overview of files

download a binary release of tesseract

copy tesseract/tessdata/tessconfigs/lstm.train to tesseract/lstm.train so that it exists next to tesseract/tesseract.exe (or without the .exe if you're on a unix-like OS)

you need some scripts typically excluded from binary releases of tesseract. as of right now, they're only the .sh from https://github.com/tesseract-ocr/tesseract/tree/master/training

tesstrain_utils.sh contains an inherent flaw: you cannot create multiple pages of text per font. only a single page. this is one of the things that proves that google has training tools that they don't release.

modified tesstrain_utils.sh: https://pastebin.com/raw/7M2GftvK

modified tesstrain.sh: https://pastebin.com/raw/uGTwyf9Q

this modification adds a --textlist flag that you can use to feed it multiple text files at once. each text file should only contain around 40 lines of text, after line wrapping, or else it probably won't all fit in the page. if only a little is chopped off it's not a big deal though.

making training data

invoke tesstrain similar to the following from a bash shell (I use a bastardized version of git bash with parts of vanilla mingw-w64 thrown in):

training/tesstrain.sh --fonts_dir /c/windows/fonts/ --linedata_only --noextract_font_properties --lang jpn --langdata_dir train/langdata/ --tessdata_dir ./tessdata --output_dir train --fontlist "Meiryo UI" "Meiryo UI Bold" --exposures "0" --textlist "train/tex1.txt" "train/tex2.txt"

if it starts up very slowly override the fontconfig temp directory to a non-temp directory using the appropriate command line argument. yes, it uses fontconfig, and has the same "generating font cache" problem that gimp and vlc suffer from on windows.

tesstrain.sh has a flat to list all fonts on the system using the name with which it accepts them. if a given font prints out a unicode name in this view then that font cannot be used in this step on windows because the commands tesstrain.sh calls are not aware of _wfopen().

--langdata_dir should point to a directory that contains these files

Han.unicharset Han.xheights Hiragana.unicharset Hiragana.xheights Katakana.unicharset Katakana.xheights Latin.unicharset Latin.xheights radical-stroke.txt

and any others you figure out tesseract wants to be there for some reason

--tessdata_dir needs a dot at the start if it's a relative path on my machine for some reason, not sure why

--output_dir is where a fuckton of .lstmf files go, as well as the important output_dir/jpn.training_files.txt and output_dir/jpn/jpn.traineddata.

you need to create output_dir and any other custom directories first or they can't be written to by any commands on this page

after this, extract jpn.lstm from the existing jpn.traineddata with combine_tessdata.exe -u or -e

you're finally ready to fine tune the existing jpn.traineddata to your new training data

training

create a directory like zvfe to store training checkpoints in

lstmtraining.exe --model_output zvfe/working --continue_from path/to/old/jpn.lstm --old_traineddata path/to/old/jpn.traineddata --traineddata output_dir/jpn/jpn.traineddata --train_listfile output_dir/jpn.training_files.txt --eval_listfile output_dir/jpn.training_files.txt

output_dir/jpn/jpn.traineddata is the jpn/jpn.traineddata that was previously created by tesstrain.sh

to make sure things work, after it gives the first message saying that it wrote a checkpoint, kill it with ctrl+c and run a command like this:

lstmtraining.exe --stop_training --continue_from zvfe/working_checkpoint --traineddata output_dir/jpn/jpn.traineddata --model_output out/jpn.traineddata

make out/ or your chosen output directory first

if this works you're ready to do a long training session, execute the original lstmtraining command again and let it run for an hour or two

etc

if you experience problems with lstmtrain reaching bad local minimums, you want to shuffle output_dir/jpn.training_files.txt into a random order. most unix-like shell environments provide the "shuf" command for this.

if you want to train for vertical text, use jpn_vert.traineddata and pass the appropriate flags to tesstrain.sh when making training data. i have not tested this. good luck.

want to train off of existing annotated images? don't bother. they need to be annotated per-character and have character sequences associated into lines. this was fine for the small amount of data needed to train tesseract 3, but tesseract 4 needs way more. google supposedly has a trainer that can handle pure lines of text without defined character boundaries, but the existing public patch that does that is apparently unstable. you can find it somewhere hidden among tesseract-ocr's github issues.

Tesseract (last edited 2018-01-17 07:55:36 by weh)