Installing Tesseract OCR on Ubuntu 14.04

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
http://hanzratech.in/2015/01/16/ocr-using-tesseract-on-ubuntu-14-04.html

Compilation:

  1. clone the package from github
  2. Go to the new dir
  3. autogen
  4. configure

    Here i can’t configure it correctly because it always complained about leptonica 1.74

    I already did

    But the error still persist!
    SOLUTION: I HAVE TO COMPILE AND INSTALL LEPTONICA 1.74 MANUALLY (READ: http://myprojects.advchaweb.com/index.php/2017/02/02/installing-leptonica-1-74-1-on-ubuntu-14-04/)
    NOW IT SUCCESS!

     
  5. Since we have to compile leptonica to use version 1.74, we should use LDFLAGS=”-L/usr/local/lib” CFLAGS=”-I/usr/local/include” make instead of make for Tesseract.

     
  6. make install

     
  7. sudo ldconfig

     
  8. Make training

     
  9. Install the training

    For the training tutorial, pls read: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 (OR https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract for older tesseract version). It says it use neural network-based recognition engine. ALSO “Tesseract 4.00 takes a few days to a couple of weeks. Even with all this new training data, you might find it inadequate for your particular problem, and therefore you are here wanting to retrain it.”. IT’D TAKE A FEW DAYS – WEEKS??? IT SEEMS SAME WITH TRAINING OPENCV. IF I HAVE PLENTY OF TIME I CAN DO THAT. ACTUALLY IT’S INTERESTING!
  10. For visual debugging, build ScrollView.jar

    Export ‘SCROLLVIEW_PATH’:

     

     

  11. Install Language
    For example to install english and many other files, pls see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. I downloaded english language https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.eng.tar.gz/download.
    <OLD>Then extract the zip file ‘tesseract-ocr-3.02.eng.tar.gz’. then move all files in /tesseract-ocr/tessdata/ to /tesseract/tessdata/
    DONT FORGET TO POINT OUT ‘TESSDATA_PREFIX’!

    </OLD>
    <NEW>It’d be much better to copy/move the ‘tessdata’ directory into /usr/local/share/tessdata/ than copy/move them into /cpp/tesseract/tessdata/ above (the OLD> because we dont have to type ‘export TESSDATA_PREFIX…’ everytime we need to scan an image. I did this after did the old one

    </NEW>
    AT FIRST I FORGOT ABOUT THIS. WHEN I DID A TEST, HERE IS THE MESSAGE
  12. Check tesseract version (with -v or –version)

     
  13. Test!

    Here is the image source (‘phototest.tif’) And here is the result in /tesseract/output.txt

    NOTE : IT BETTER TO NOT USE FILENAME WITH HAS SPACE BECAUSE TESSERACT CAN’T FIND IT!
    ALSO USE IMAGE WITH HIGH RESOLUTION! OR WE CAN DO THE TRAINING STUFF (IF WE HAVE PLENTY OF TIME!!!)
    OTHER TESTS

    BUT I FOUND THE RESULT STILL NOT GOOD!!! MANY WEIRD CHARS, SPELLING MISTAKES,ETC
    ANOTHER TEST WITH COMMAND TERMINAL:

    Display the result in the terminal:

    open the original image so we can compare it

     
  14. Question: Can we use tesseract to read/scan pdf file?
    pls read: http://kiirani.com/2013/03/22/tesseract-pdf.html
    http://www.barryhubbard.com/linux/converting-pdf-to-text-using-tesseract/
    http://stackoverflow.com/questions/30925218/converting-a-pdf-to-text-using-tesseract-ocr
    TESSERACT CAN’T DO THIS DIRECTLY. IT SAID TO CONVERT THE PDF TO TIFF IMAGE FIRST! ALSO THE MULTI PAGE PDF NEED TO BE CONVERTED TO MULTI TIFF FILES! READ http://www.barryhubbard.com/linux/converting-pdf-to-text-using-tesseract/ TO SEE THE SCRIPT TO DO THAT!

Leave a Reply

Your email address will not be published. Required fields are marked *