PDF TO TEXT CONVERTER

Reference: http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

I already tested and installed a few good open source code, like pdfparser, pdfbox and pdfminer.

PDFPARSER:

 

  1. clone from https://github.com/smalot/pdfparser
  2. composer install
  3. test by creating a new file index.php in /works/pdfparser/index.php

    NOT GOOD!

 

PDFBOX:

  1. clone from https://github.com/schmengler/PdfBox
  2. composer install
  3. download jarfile  from http://pdfbox.apache.org/index.html (pdfbox-app-2.4.0.jar)
    then move it to /usr/bin
  4. test by creating a new file index.php in /works/PdfBox/index.php

    BUT NEVER WORKS BECAUSE ALWAYS COMPLAIN ABOUT ‘RUNTIME ERROR CANNOT ACCESS JARFILE’. ALREADY SET PERMISSION BUT THE ERROR STILL PERSIST. SO I RUN THE JAR VIA TERMINAL (READ: http://pdfbox.apache.org/2.0/commandline.html)

    BUT NOT GOOD!

PDFMINER INSTALLATION:
It use python 2.7

  1. clone from the source (https://github.com/euske/pdfminer/)
  2. Run setup.py

    Here I have to use ‘sudo’ to install because I got permission problem without it
  3. Test

    BEST OF THE OTHERS BUT STILL NOT GOOD! SELECT ALL THE TEXT THEN COPY PASTE STILL BETTER!

I TRIED TO INSTALL TESSERACT BUT EVEN WORSE THEN ABOVE!!!