Tesseract Training – My Projects & Live

OK. This is really hard and also I dont have enough knowledge to do the training process. I have to learn much and comprehend all the process. I need to spend more time to do this. I have spent two days for this.

The testing files in /Documents/cpp/tesseract/testing/ directory. First I need to convert the pdf file (export1.pdf) to tiff image with density 500dpi and keep the quality 100%

teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ convert -density 500 export1.pdf -quality 100 export1.tif

1	teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ convert -density 500 export1.pdf -quality 100 export1.tif

I got the tif image export1.tif with more than 190MB size!!! I dont know if this is correct or not. I sliced the export1.tif image and I created 14 tif images as samples from the tif image to train tesseract to recognize the text on the images. Then I use LIOS (Application -> Graphics -> Tesseract-Trainer) to train the 14 images and make the boxes for them. I read another article that to get the good result, I need to have more than 10 samples. I got 14 so I think it’d be better. This training process is really tedious. I have to check each character and make sure they all correct. Sometimes (many times) it showed the wrong characters and wrong box selection. I NEED TO CREATE A TUTORIAL HOW TO TRAIN THE IMAGE WITH LIOS!!! AT the end it’s failed miserably when I tested with the train data. The train data is in /usr/local/share/tessdata/ directory. The last train data is ‘train4’. I create two test with the training data. First, I use the train4 data

teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ tesseract export1.tif export1_tif_out -l train4
Info in bmfCreate: Generating pixa of bitmap fonts from string
Tesseract Open Source OCR Engine vbd45b3a with Leptonica
Warning in pixReadFromTiffStream: bpp = 64; stripping 16 bit rgb samples down to 8
Page 1
Detected 731 diacritics

teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ tesseract export1.tif export1_tif_out -l train4

Info in bmfCreate: Generating pixa of bitmap fonts from string

Tesseract Open Source OCR Engine vbd45b3a with Leptonica

Warning in pixReadFromTiffStream: bpp = 64; stripping 16 bit rgb samples down to 8

Page 1

Detected 731 diacritics

The result is SHAME! then I want to use the default ‘eng’ train data to see the comparison

teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ tesseract export1.tif export1_tif_eng_out -l eng
Info in bmfCreate: Generating pixa of bitmap fonts from string
Tesseract Open Source OCR Engine vbd45b3a with Leptonica
Warning in pixReadFromTiffStream: bpp = 64; stripping 16 bit rgb samples down to 8
Page 1
Detected 731 diacritics

teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ tesseract export1.tif export1_tif_eng_out -l eng

Info in bmfCreate: Generating pixa of bitmap fonts from string

Tesseract Open Source OCR Engine vbd45b3a with Leptonica

Warning in pixReadFromTiffStream: bpp = 64; stripping 16 bit rgb samples down to 8

Page 1

Detected 731 diacritics

The result is better BUT STILL A CRAP!
I see in /usr/local/share/tessdata/ directory there are many files related to the ‘eng’:

/usr/local/share/tessdata/eng.cube.bigrams
/usr/local/share/tessdata/eng.cube.fold
/usr/local/share/tessdata/eng.cube.lm
/usr/local/share/tessdata/eng.cube.nn
/usr/local/share/tessdata/eng.cube.params
/usr/local/share/tessdata/eng.cube.size
/usr/local/share/tessdata/eng.cube.word-freq
/usr/local/share/tessdata/eng.tesseract_cube.nn
/usr/local/share/tessdata/eng.traineddata
/usr/local/share/tessdata/eng.user-patterns
/usr/local/share/tessdata/eng.user-words

/usr/local/share/tessdata/eng.cube.bigrams

/usr/local/share/tessdata/eng.cube.fold

/usr/local/share/tessdata/eng.cube.lm

/usr/local/share/tessdata/eng.cube.nn

/usr/local/share/tessdata/eng.cube.params

/usr/local/share/tessdata/eng.cube.size

/usr/local/share/tessdata/eng.cube.word-freq

/usr/local/share/tessdata/eng.tesseract_cube.nn

/usr/local/share/tessdata/eng.traineddata

/usr/local/share/tessdata/eng.user-patterns

/usr/local/share/tessdata/eng.user-words

BUT I ONLY HAVE ONE FILE

/usr/local/share/tessdata/train4.traineddata

1	/usr/local/share/tessdata/train4.traineddata

I THINK I DIDN’T PREPARE THE TRAINING CORRECTLY OR MAY BE I MISS SOMETHING????
READ SOME TUTORIAL:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

Leave a Reply Cancel reply