Tesseract Training

OK. This is really hard and also I dont have enough knowledge to do the training process. I have to learn much and comprehend all the process. I need to spend more time to do this. I have spent two days for this.

The testing files in /Documents/cpp/tesseract/testing/ directory. First I need to convert the pdf file (export1.pdf) to tiff image with density 500dpi and keep the quality 100%

I got the tif image export1.tif with more than 190MB size!!! I dont know if this is correct or not. I sliced the export1.tif image and I created 14 tif images as samples from the tif image to train tesseract to recognize the text on the images. Then I use LIOS (Application -> Graphics -> Tesseract-Trainer) to train the 14 images and make the boxes for them. I read another article that to get the good result, I need to have more than 10 samples. I got 14 so I think it’d be better. This training process is really tedious. I have to check each character and make sure they all correct. Sometimes (many times) it showed the wrong characters and wrong box selection. I NEED TO CREATE A TUTORIAL HOW TO TRAIN THE IMAGE WITH LIOS!!! AT the end it’s failed miserably when I tested with the train data. The train data is in /usr/local/share/tessdata/ directory. The last train data is ‘train4’. I create two test with the training data. First, I use the train4 data

The result is SHAME! then I want to use the default ‘eng’ train data to see the comparison

The result is better BUT STILL A CRAP!
I see in /usr/local/share/tessdata/ directory there are many files related to the ‘eng’:

BUT I ONLY HAVE ONE FILE

I THINK I DIDN’T PREPARE THE TRAINING CORRECTLY OR MAY BE I MISS SOMETHING????
READ SOME TUTORIAL:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

 

Leave a Reply

Your email address will not be published. Required fields are marked *