Tesseract Training

OK. This is really hard and also I dont have enough knowledge to do the training process. I have to learn much and comprehend all the process. I need to spend more time to do this. I have spent two days for this.

The testing files in /Documents/cpp/tesseract/testing/ directory. First I need to convert the pdf file (export1.pdf) to tiff image with density 500dpi and keep the quality 100%

I got the tif image export1.tif with more than 190MB size!!! I dont know if this is correct or not. I sliced the export1.tif image and I created 14 tif images as samples from the tif image to train tesseract to recognize the text on the images. Then I use LIOS (Application -> Graphics -> Tesseract-Trainer) to train the 14 images and make the boxes for them. I read another article that to get the good result, I need to have more than 10 samples. I got 14 so I think it’d be better. This training process is really tedious. I have to check each character and make sure they all correct. Sometimes (many times) it showed the wrong characters and wrong box selection. I NEED TO CREATE A TUTORIAL HOW TO TRAIN THE IMAGE WITH LIOS!!! AT the end it’s failed miserably when I tested with the train data. The train data is in /usr/local/share/tessdata/ directory. The last train data is ‘train4’. I create two test with the training data. First, I use the train4 data

The result is SHAME! then I want to use the default ‘eng’ train data to see the comparison

The result is better BUT STILL A CRAP!
I see in /usr/local/share/tessdata/ directory there are many files related to the ‘eng’:

BUT I ONLY HAVE ONE FILE

I THINK I DIDN’T PREPARE THE TRAINING CORRECTLY OR MAY BE I MISS SOMETHING????
READ SOME TUTORIAL:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

 

LIOS On Ubuntu 14.04

Linux-Intelligent-Ocr-Solution (LIOS)
Easy-OCR solution and Tesseract trainer for GNU/Linux
Download link: https://sourceforge.net/projects/lios
Currently I got the package (.deb) is version 2.5 but I got some bugs

So I tried to install the newest version (version 3) from the source in https://github.com/Nalin-x-Linux/lios-3:

  1. Clone the repository

     
  2. Before install this python package, it’d be much better to use virtualenv to do that because we’ll install some packages. I already install virtualenv and use python 3 there. Read the tutorial http://myprojects.advchaweb.com/index.php/2016/08/24/my-ubuntu-14-04-and-apps-installation-error-and-solution/. Create a new virtualenv ‘lios’

    NOTE: if the virtualenv ‘lios’ already exists, we can go into it via ‘workon lios’ command!
    check python version

     
  3. Go to the local ‘lios-3’ directory
  4. Install

     
  5. Run the program ‘lios’ BUT I GOT AN ERROR

    Solution: Install ‘gi’ module via pip

    Run the program again but another error:

    Solution: This error occured because in python 3, the ‘print’ must be followed by parentheses! like print(‘Hello’) NOT print ‘Hello’. So modify the file in ‘/home/teddy/.virtualenvs/lios/lib/python3.4/site-packages/gi/__init__.py’ like this

    Run the program again, another error:

    Solution: This problem really made me headache. I installed pacman (NO.THIS IS FOR ARCHLINUX. REMOVE IT IN No package for ‘gi.repository’ but this link http://stackoverflow.com/questions/31324430/installing-pygobject-via-pip-in-virtualenv?rq=1 gave me a way to solve this problem. I see in ‘/usr/lib/python3/dist-packages/gi/’ directory, there is a directory ‘_gobject’ that I think I can use it so I just ‘symlink’ the dir to ‘lios’ virtual env like this:

    NOTE: I need to rename the old ‘gi’ directory I installed before so the symlink can works.
    DO THE SAME FOR THE OTHERS MODULES ‘enchant’ and ‘speechd’ LIKE THIS:

    RUN AGAIN AND THE GUI IS APPEARED BUT FOLLOWED WITH MANY ERRORS AND THE VERSION DONT CHANGE (STILL VERSION 2.0)!!! –>DON’T USE IT BECAUSE IT CAN’T BE CLOSED!!!!

    I HAVE TO KILL THE TERMINAL MANUALLY!!!
    OK. THIS PROBLEM ONLY IF I RUN ‘LIOS’ FROM VIRTUALENV. BUT I RUN IT LIKE USUAL, IT’D BE NO PROBLEM!

PDF TO TEXT CONVERTER

Reference: http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

I already tested and installed a few good open source code, like pdfparser, pdfbox and pdfminer.

PDFPARSER:

 

  1. clone from https://github.com/smalot/pdfparser
  2. composer install
  3. test by creating a new file index.php in /works/pdfparser/index.php

    NOT GOOD!

 

PDFBOX:

  1. clone from https://github.com/schmengler/PdfBox
  2. composer install
  3. download jarfile  from http://pdfbox.apache.org/index.html (pdfbox-app-2.4.0.jar)
    then move it to /usr/bin
  4. test by creating a new file index.php in /works/PdfBox/index.php

    BUT NEVER WORKS BECAUSE ALWAYS COMPLAIN ABOUT ‘RUNTIME ERROR CANNOT ACCESS JARFILE’. ALREADY SET PERMISSION BUT THE ERROR STILL PERSIST. SO I RUN THE JAR VIA TERMINAL (READ: http://pdfbox.apache.org/2.0/commandline.html)

    BUT NOT GOOD!

PDFMINER INSTALLATION:
It use python 2.7

  1. clone from the source (https://github.com/euske/pdfminer/)
  2. Run setup.py

    Here I have to use ‘sudo’ to install because I got permission problem without it
  3. Test

    BEST OF THE OTHERS BUT STILL NOT GOOD! SELECT ALL THE TEXT THEN COPY PASTE STILL BETTER!

I TRIED TO INSTALL TESSERACT BUT EVEN WORSE THEN ABOVE!!!

Installing Tesseract OCR on Ubuntu 14.04

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
http://hanzratech.in/2015/01/16/ocr-using-tesseract-on-ubuntu-14-04.html

Compilation:

  1. clone the package from github
  2. Go to the new dir
  3. autogen
  4. configure

    Here i can’t configure it correctly because it always complained about leptonica 1.74

    I already did

    But the error still persist!
    SOLUTION: I HAVE TO COMPILE AND INSTALL LEPTONICA 1.74 MANUALLY (READ: http://myprojects.advchaweb.com/index.php/2017/02/02/installing-leptonica-1-74-1-on-ubuntu-14-04/)
    NOW IT SUCCESS!

     
  5. Since we have to compile leptonica to use version 1.74, we should use LDFLAGS=”-L/usr/local/lib” CFLAGS=”-I/usr/local/include” make instead of make for Tesseract.

     
  6. make install

     
  7. sudo ldconfig

     
  8. Make training

     
  9. Install the training

    For the training tutorial, pls read: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 (OR https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract for older tesseract version). It says it use neural network-based recognition engine. ALSO “Tesseract 4.00 takes a few days to a couple of weeks. Even with all this new training data, you might find it inadequate for your particular problem, and therefore you are here wanting to retrain it.”. IT’D TAKE A FEW DAYS – WEEKS??? IT SEEMS SAME WITH TRAINING OPENCV. IF I HAVE PLENTY OF TIME I CAN DO THAT. ACTUALLY IT’S INTERESTING!
  10. For visual debugging, build ScrollView.jar

    Export ‘SCROLLVIEW_PATH’:

     

     

  11. Install Language
    For example to install english and many other files, pls see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. I downloaded english language https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.eng.tar.gz/download.
    <OLD>Then extract the zip file ‘tesseract-ocr-3.02.eng.tar.gz’. then move all files in /tesseract-ocr/tessdata/ to /tesseract/tessdata/
    DONT FORGET TO POINT OUT ‘TESSDATA_PREFIX’!

    </OLD>
    <NEW>It’d be much better to copy/move the ‘tessdata’ directory into /usr/local/share/tessdata/ than copy/move them into /cpp/tesseract/tessdata/ above (the OLD> because we dont have to type ‘export TESSDATA_PREFIX…’ everytime we need to scan an image. I did this after did the old one

    </NEW>
    AT FIRST I FORGOT ABOUT THIS. WHEN I DID A TEST, HERE IS THE MESSAGE
  12. Check tesseract version (with -v or –version)

     
  13. Test!

    Here is the image source (‘phototest.tif’) And here is the result in /tesseract/output.txt

    NOTE : IT BETTER TO NOT USE FILENAME WITH HAS SPACE BECAUSE TESSERACT CAN’T FIND IT!
    ALSO USE IMAGE WITH HIGH RESOLUTION! OR WE CAN DO THE TRAINING STUFF (IF WE HAVE PLENTY OF TIME!!!)
    OTHER TESTS

    BUT I FOUND THE RESULT STILL NOT GOOD!!! MANY WEIRD CHARS, SPELLING MISTAKES,ETC
    ANOTHER TEST WITH COMMAND TERMINAL:

    Display the result in the terminal:

    open the original image so we can compare it

     
  14. Question: Can we use tesseract to read/scan pdf file?
    pls read: http://kiirani.com/2013/03/22/tesseract-pdf.html
    http://www.barryhubbard.com/linux/converting-pdf-to-text-using-tesseract/
    http://stackoverflow.com/questions/30925218/converting-a-pdf-to-text-using-tesseract-ocr
    TESSERACT CAN’T DO THIS DIRECTLY. IT SAID TO CONVERT THE PDF TO TIFF IMAGE FIRST! ALSO THE MULTI PAGE PDF NEED TO BE CONVERTED TO MULTI TIFF FILES! READ http://www.barryhubbard.com/linux/converting-pdf-to-text-using-tesseract/ TO SEE THE SCRIPT TO DO THAT!

Installing Leptonica 1.74.1 on Ubuntu 14.04

Reference: http://hanzratech.in/2015/01/16/ocr-using-tesseract-on-ubuntu-14-04.html
http://www.leptonica.org/download.html

  1. Download the newest leptonica version (1.74.1) here : http://www.leptonica.org/source/leptonica-1.74.1.tar.gz
    I use wget:


     
  2. Extract the zip file
  3. Go to the new created directory (leptonica-1.74.1) then configure

     
  4. make

     
  5. Because my machine didn’t have ‘checkinstall’, install it first

     
  6. Create the package with ‘checkinstall’

     
  7. Then execute ‘sudo ldconfig’