OK. This is really hard and also I dont have enough knowledge to do the training process. I have to learn much and comprehend all the process. I need to spend more time to do this. I have spent two days for this.
The testing files in /Documents/cpp/tesseract/testing/ directory. First I need to convert the pdf file (export1.pdf) to tiff image with density 500dpi and keep the quality 100%
|
1 |
teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ convert -density 500 export1.pdf -quality 100 export1.tif |
I got the tif image export1.tif with more than 190MB size!!! I dont know if this is correct or not. I sliced the export1.tif image and I created 14 tif images as samples from the tif image to train tesseract to recognize the text on the images. Then I use LIOS (Application -> Graphics -> Tesseract-Trainer) to train the 14 images and make the boxes for them. I read another article that to get the good result, I need to have more than 10 samples. I got 14 so I think it’d be better. This training process is really tedious. I have to check each character and make sure they all correct. Sometimes (many times) it showed the wrong characters and wrong box selection. I NEED TO CREATE A TUTORIAL HOW TO TRAIN THE IMAGE WITH LIOS!!! AT the end it’s failed miserably when I tested with the train data. The train data is in /usr/local/share/tessdata/ directory. The last train data is ‘train4’. I create two test with the training data. First, I use the train4 data
|
1 2 3 4 5 6 |
teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ tesseract export1.tif export1_tif_out -l train4 Info in bmfCreate: Generating pixa of bitmap fonts from string Tesseract Open Source OCR Engine vbd45b3a with Leptonica Warning in pixReadFromTiffStream: bpp = 64; stripping 16 bit rgb samples down to 8 Page 1 Detected 731 diacritics |
The result is SHAME! then I want to use the default ‘eng’ train data to see the comparison
|
1 2 3 4 5 6 |
teddy@teddy-K43SJ:~/Documents/cpp/tesseract/testing$ tesseract export1.tif export1_tif_eng_out -l eng Info in bmfCreate: Generating pixa of bitmap fonts from string Tesseract Open Source OCR Engine vbd45b3a with Leptonica Warning in pixReadFromTiffStream: bpp = 64; stripping 16 bit rgb samples down to 8 Page 1 Detected 731 diacritics |
The result is better BUT STILL A CRAP!
I see in /usr/local/share/tessdata/ directory there are many files related to the ‘eng’:
|
1 2 3 4 5 6 7 8 9 10 11 |
/usr/local/share/tessdata/eng.cube.bigrams /usr/local/share/tessdata/eng.cube.fold /usr/local/share/tessdata/eng.cube.lm /usr/local/share/tessdata/eng.cube.nn /usr/local/share/tessdata/eng.cube.params /usr/local/share/tessdata/eng.cube.size /usr/local/share/tessdata/eng.cube.word-freq /usr/local/share/tessdata/eng.tesseract_cube.nn /usr/local/share/tessdata/eng.traineddata /usr/local/share/tessdata/eng.user-patterns /usr/local/share/tessdata/eng.user-words |
BUT I ONLY HAVE ONE FILE
|
1 |
/usr/local/share/tessdata/train4.traineddata |
I THINK I DIDN’T PREPARE THE TRAINING CORRECTLY OR MAY BE I MISS SOMETHING????
READ SOME TUTORIAL:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/