tesseract is an old commercial OCR system released as open source and revived by google

tesseract 4 has a long-short-term-memory neural network in it to remove the ceiling on text recognition accuracy that the old text recognition method had

google has private internal tools and training sets that they don't release to the public

they probably appropriate a bunch of commercial material for it and the training sets required for neural network training are so large that the amount released from individual sources would not be fair use

how to extend existing tesseract 4 traineddata files to recognize new characters or fonts

overview of files

download a binary release of tesseract

copy tesseract/tessdata/tessconfigs/lstm.train to tesseract/lstm.train so that it exists next to tesseract/tesseract.exe (or without the .exe if you're on a unix-like OS)

you need some scripts typically excluded from binary releases of tesseract. as of right now, they're only the .sh from https://github.com/tesseract-ocr/tesseract/tree/master/training

making training data

use these edited versions of tesstrain_utils.sh: https://pastebin.com/TfqJUxSR and tesstrain.sh: https://pastebin.com/cD5wctUG

this adds a new command, --textlist, and raises the page limit from 3 to 100. you might need to raise it further to 1000 or something if your inputs are extremely long.

invoke tesstrain similar to the following from a bash shell (I use a bastardized version of git bash with parts of vanilla mingw-w64 thrown in):

training/tesstrain.sh --fonts_dir /c/windows/fonts/ --linedata_only --noextract_font_properties --lang jpn --langdata_dir train/langdata/ --tessdata_dir ./tessdata --output_dir train --fontlist "Meiryo UI" "Meiryo UI Bold" --exposures "0" --textlist "train/tex1.txt" "train/tex2.txt"

if it starts up very slowly override the fontconfig temp directory to a non-temp directory by modifying the line FONT_CONFIG_CACHE="/c/users/wareya/bogus_fcfg/. yes, it uses fontconfig, and has the same "generating font cache" problem that gimp and vlc suffer from on windows.

tesstrain.sh has a flag to list all fonts on the system using the name with which it accepts them. if a given font prints out a unicode name in this view then that font cannot be used in this step on windows because the commands tesstrain.sh calls are not aware of _wfopen().

--langdata_dir should point to a directory that contains these files

Han.unicharset Han.xheights Hiragana.unicharset Hiragana.xheights Katakana.unicharset Katakana.xheights Latin.unicharset Latin.xheights radical-stroke.txt

and any others you figure out tesseract wants to be there for some reason

--tessdata_dir needs a dot at the start if it's a relative path on my machine for some reason, not sure why

--output_dir is where a fuckton of .lstmf files go, as well as the important output_dir/jpn.training_files.txt and output_dir/jpn/jpn.traineddata.

you need to create output_dir and any other custom directories first or they can't be written to by any commands on this page

after this, extract jpn.lstm from the existing jpn.traineddata with combine_tessdata.exe -u or -e

you're finally ready to fine tune the existing jpn.traineddata to your new training data

training

create a directory like zvfe to store training checkpoints in

lstmtraining.exe --model_output zvfe/working --continue_from path/to/old/jpn.lstm --old_traineddata path/to/old/jpn.traineddata --traineddata output_dir/jpn/jpn.traineddata --train_listfile output_dir/jpn.training_files.txt --eval_listfile output_dir/jpn.training_files.txt

output_dir/jpn/jpn.traineddata is the jpn/jpn.traineddata that was previously created by tesstrain.sh

to make sure things work, after it gives the first message saying that it wrote a checkpoint, kill it with ctrl+c and run a command like this:

lstmtraining.exe --stop_training --continue_from zvfe/working_checkpoint --traineddata output_dir/jpn/jpn.traineddata --model_output out/jpn.traineddata

make out/ or your chosen output directory first

if this works you're ready to do a long training session, execute the original lstmtraining command again and let it run for an hour or two

etc

if you experience problems with lstmtrain reaching bad local minimums, you want to shuffle output_dir/jpn.training_files.txt into a random order. most unix-like shell environments provide the "shuf" command for this.

it might also be a good idea to have some pages of every kanji, kana, and punctuation that you care about, to make it less likely for the fine-tuning on your training data to render them unrecognizable

if you want to train for vertical text, use jpn_vert.traineddata and pass the appropriate flags to tesstrain.sh when making training data. i have not tested this. good luck.

want to train off of existing annotated images? don't bother. they need to be annotated per-character and have character sequences associated into lines. this was fine for the small amount of data needed to train tesseract 3, but tesseract 4 needs way more. google supposedly has a trainer that can handle pure lines of text without defined character boundaries, but the existing public patch that does that is apparently unstable. you can find it somewhere hidden among tesseract-ocr's github issues.

Tesseract

how to extend existing tesseract 4 traineddata files to recognize new characters or fonts

overview of files

making training data

training

etc