Reference: http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html
I already tested and installed a few good open source code, like pdfparser, pdfbox and pdfminer.
PDFPARSER:
- clone from https://github.com/smalot/pdfparser
- composer install
- test by creating a new file index.php in /works/pdfparser/index.php
12345678910111213141516171819<?phpini_set('display_errors', 1);ini_set('display_startup_errors', 1);error_reporting(E_ALL);// Include Composer autoloader if not already done.//include 'vendor/autoload.php';require_once('vendor/autoload.php');// Parse pdf file and build necessary objects.$parser = new \Smalot\PdfParser\Parser();$pdf = $parser->parseFile('export1.pdf');//$pdf = $parser->parseFile('samples/Document1_foxitreader.pdf');$pages = $pdf->getPages();// Loop over each page to extract text.foreach ($pages as $page) {echo nl2br($page->getText());//echo $page->getText();}?>
NOT GOOD!
PDFBOX:
- clone from https://github.com/schmengler/PdfBox
- composer install
- download jarfile from http://pdfbox.apache.org/index.html (pdfbox-app-2.4.0.jar)
then move it to /usr/bin - test by creating a new file index.php in /works/PdfBox/index.php
1234567891011121314151617<?phpini_set('display_errors', 1);ini_set('display_startup_errors', 1);error_reporting(E_ALL);// Include Composer autoloader if not already done.//include 'vendor/autoload.php';require_once('vendor/autoload.php');//use SGH\PdfBox;//$pdf = GENERATED_PDF;$converter = new \SGH\PdfBox\PdfBox();$converter->setPathToPdfBox('/usr/bin/pdfbox-app-2.4.0.jar');//$text = $converter->textFromPdfStream($pdf);$text = $converter->textFromPdfFile('export1.pdf');echo $text;?>
BUT NEVER WORKS BECAUSE ALWAYS COMPLAIN ABOUT ‘RUNTIME ERROR CANNOT ACCESS JARFILE’. ALREADY SET PERMISSION BUT THE ERROR STILL PERSIST. SO I RUN THE JAR VIA TERMINAL (READ: http://pdfbox.apache.org/2.0/commandline.html)
1java -jar pdfbox-app-2.4.0.jar ExtractText export1.pdf
BUT NOT GOOD!
PDFMINER INSTALLATION:
It use python 2.7
- clone from the source (https://github.com/euske/pdfminer/)
1teddy@teddy-K43SJ:~/Documents/python$ git clone https://github.com/euske/pdfminer.git - Run setup.py
1teddy@teddy-K43SJ:~/Documents/python$ sudo python setup.py install
Here I have to use ‘sudo’ to install because I got permission problem without it - Test
1teddy@teddy-K43SJ:~/Documents/python$ pdf2txt.py -o export1.txt export1.pdf
BEST OF THE OTHERS BUT STILL NOT GOOD! SELECT ALL THE TEXT THEN COPY PASTE STILL BETTER!
I TRIED TO INSTALL TESSERACT BUT EVEN WORSE THEN ABOVE!!!