Python – Page 2 – My Projects & Live

Reference: http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

I already tested and installed a few good open source code, like pdfparser, pdfbox and pdfminer.

PDFPARSER:

clone from https://github.com/smalot/pdfparser
composer install

test by creating a new file index.php in /works/pdfparser/index.php

<?php
    ini_set('display_errors', 1);
    ini_set('display_startup_errors', 1);
    error_reporting(E_ALL);

    // Include Composer autoloader if not already done.
    //include 'vendor/autoload.php';
    require_once('vendor/autoload.php');
    // Parse pdf file and build necessary objects.
    $parser = new \Smalot\PdfParser\Parser();
    $pdf    = $parser->parseFile('export1.pdf');
    //$pdf    = $parser->parseFile('samples/Document1_foxitreader.pdf');
    $pages  = $pdf->getPages(); 
    // Loop over each page to extract text.
    foreach ($pages as $page) {
        echo nl2br($page->getText());
        //echo $page->getText();
    }
?>

<?php

ini_set('display_errors', 1);

ini_set('display_startup_errors', 1);

error_reporting(E_ALL);

// Include Composer autoloader if not already done.

//include 'vendor/autoload.php';

require_once('vendor/autoload.php');

// Parse pdf file and build necessary objects.

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('export1.pdf');

//$pdf = $parser->parseFile('samples/Document1_foxitreader.pdf');

$pages = $pdf->getPages();

// Loop over each page to extract text.

foreach ($pages as $page) {

echo nl2br($page->getText());

//echo $page->getText();

}

NOT GOOD!

PDFBOX:

clone from https://github.com/schmengler/PdfBox
composer install
download jarfile from http://pdfbox.apache.org/index.html (pdfbox-app-2.4.0.jar)
then move it to /usr/bin

test by creating a new file index.php in /works/PdfBox/index.php

<?php
    ini_set('display_errors', 1);
    ini_set('display_startup_errors', 1);
    error_reporting(E_ALL);

    // Include Composer autoloader if not already done.
    //include 'vendor/autoload.php';
    require_once('vendor/autoload.php');
    //use SGH\PdfBox;

    //$pdf = GENERATED_PDF;
    $converter = new \SGH\PdfBox\PdfBox();
    $converter->setPathToPdfBox('/usr/bin/pdfbox-app-2.4.0.jar');
    //$text = $converter->textFromPdfStream($pdf);
    $text = $converter->textFromPdfFile('export1.pdf');
    echo $text;
?>

<?php

ini_set('display_errors', 1);

ini_set('display_startup_errors', 1);

error_reporting(E_ALL);

// Include Composer autoloader if not already done.

//include 'vendor/autoload.php';

require_once('vendor/autoload.php');

//use SGH\PdfBox;

//$pdf = GENERATED_PDF;

$converter = new \SGH\PdfBox\PdfBox();

$converter->setPathToPdfBox('/usr/bin/pdfbox-app-2.4.0.jar');

//$text = $converter->textFromPdfStream($pdf);

$text = $converter->textFromPdfFile('export1.pdf');

echo $text;

BUT NEVER WORKS BECAUSE ALWAYS COMPLAIN ABOUT ‘RUNTIME ERROR CANNOT ACCESS JARFILE’. ALREADY SET PERMISSION BUT THE ERROR STILL PERSIST. SO I RUN THE JAR VIA TERMINAL (READ: http://pdfbox.apache.org/2.0/commandline.html)

java -jar pdfbox-app-2.4.0.jar ExtractText export1.pdf

1	java -jar pdfbox-app-2.4.0.jar ExtractText export1.pdf

BUT NOT GOOD!

PDFMINER INSTALLATION:
It use python 2.7

clone from the source (https://github.com/euske/pdfminer/)

teddy@teddy-K43SJ:~/Documents/python$ git clone https://github.com/euske/pdfminer.git

1

teddy@teddy-K43SJ:~/Documents/python$ git clone https://github.com/euske/pdfminer.git
Run setup.py

teddy@teddy-K43SJ:~/Documents/python$ sudo python setup.py install

1

teddy@teddy-K43SJ:~/Documents/python$ sudo python setup.py install

Here I have to use ‘sudo’ to install because I got permission problem without it
Test

teddy@teddy-K43SJ:~/Documents/python$ pdf2txt.py -o export1.txt export1.pdf

1

teddy@teddy-K43SJ:~/Documents/python$ pdf2txt.py -o export1.txt export1.pdf

BEST OF THE OTHERS BUT STILL NOT GOOD! SELECT ALL THE TEXT THEN COPY PASTE STILL BETTER!

I TRIED TO INSTALL TESSERACT BUT EVEN WORSE THEN ABOVE!!!

Category: Python

PDF TO TEXT CONVERTER