{"id":1702,"date":"2017-02-02T21:55:07","date_gmt":"2017-02-02T21:55:07","guid":{"rendered":"http:\/\/myprojects.advchaweb.com\/?p=1702"},"modified":"2017-02-18T14:04:19","modified_gmt":"2017-02-18T14:04:19","slug":"pdf-to-text-converter","status":"publish","type":"post","link":"https:\/\/myprojects.advchaweb.com\/index.php\/2017\/02\/02\/pdf-to-text-converter\/","title":{"rendered":"PDF TO TEXT CONVERTER"},"content":{"rendered":"<p>Reference: <a href=\"http:\/\/okfnlabs.org\/blog\/2016\/04\/19\/pdf-tools-extract-text-and-data-from-pdfs.html\">http:\/\/okfnlabs.org\/blog\/2016\/04\/19\/pdf-tools-extract-text-and-data-from-pdfs.html<\/a><\/p>\n<p>I already tested and installed a few good open source code, like pdfparser, pdfbox and <a href=\"http:\/\/www.unixuser.org\/~euske\/python\/pdfminer\/\">pdfminer<\/a>.<\/p>\n<p>PDFPARSER:<\/p>\n<p>&nbsp;<\/p>\n<ol>\n<li>clone from <a href=\"https:\/\/github.com\/smalot\/pdfparser\">https:\/\/github.com\/smalot\/pdfparser<\/a><\/li>\n<li>composer install<\/li>\n<li>test by creating a new file index.php in \/works\/pdfparser\/index.php\n<pre class=\"lang:default decode:true \">&lt;?php\r\n    ini_set('display_errors', 1);\r\n    ini_set('display_startup_errors', 1);\r\n    error_reporting(E_ALL);\r\n\r\n    \/\/ Include Composer autoloader if not already done.\r\n    \/\/include 'vendor\/autoload.php';\r\n    require_once('vendor\/autoload.php');\r\n    \/\/ Parse pdf file and build necessary objects.\r\n    $parser = new \\Smalot\\PdfParser\\Parser();\r\n    $pdf    = $parser-&gt;parseFile('export1.pdf');\r\n    \/\/$pdf    = $parser-&gt;parseFile('samples\/Document1_foxitreader.pdf');\r\n    $pages  = $pdf-&gt;getPages(); \r\n    \/\/ Loop over each page to extract text.\r\n    foreach ($pages as $page) {\r\n        echo nl2br($page-&gt;getText());\r\n        \/\/echo $page-&gt;getText();\r\n    }\r\n?&gt;<\/pre>\n<p>NOT GOOD!<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>PDFBOX:<\/p>\n<ol>\n<li>clone from https:\/\/github.com\/schmengler\/PdfBox<\/li>\n<li>composer install<\/li>\n<li>download jarfile\u00a0 from http:\/\/pdfbox.apache.org\/index.html (pdfbox-app-2.4.0.jar)<br \/>\nthen move it to \/usr\/bin<\/li>\n<li>test by creating a new file index.php in \/works\/PdfBox\/index.php\n<pre class=\"lang:default decode:true\">&lt;?php\r\n    ini_set('display_errors', 1);\r\n    ini_set('display_startup_errors', 1);\r\n    error_reporting(E_ALL);\r\n\r\n    \/\/ Include Composer autoloader if not already done.\r\n    \/\/include 'vendor\/autoload.php';\r\n    require_once('vendor\/autoload.php');\r\n    \/\/use SGH\\PdfBox;\r\n\r\n    \/\/$pdf = GENERATED_PDF;\r\n    $converter = new \\SGH\\PdfBox\\PdfBox();\r\n    $converter-&gt;setPathToPdfBox('\/usr\/bin\/pdfbox-app-2.4.0.jar');\r\n    \/\/$text = $converter-&gt;textFromPdfStream($pdf);\r\n    $text = $converter-&gt;textFromPdfFile('export1.pdf');\r\n    echo $text;\r\n?&gt;<\/pre>\n<p>BUT NEVER WORKS BECAUSE ALWAYS COMPLAIN ABOUT &#8216;RUNTIME ERROR CANNOT ACCESS JARFILE&#8217;. ALREADY SET PERMISSION BUT THE ERROR STILL PERSIST. SO I RUN THE JAR VIA TERMINAL (READ: http:\/\/pdfbox.apache.org\/2.0\/commandline.html)<\/p>\n<pre class=\"lang:default decode:true \">java -jar pdfbox-app-2.4.0.jar ExtractText export1.pdf<\/pre>\n<p>BUT NOT GOOD!<\/li>\n<\/ol>\n<p>PDFMINER INSTALLATION:<br \/>\nIt use python 2.7<\/p>\n<ol>\n<li>clone from the source (https:\/\/github.com\/euske\/pdfminer\/)\n<pre class=\"lang:default decode:true\">teddy@teddy-K43SJ:~\/Documents\/python$ git clone https:\/\/github.com\/euske\/pdfminer.git<\/pre>\n<\/li>\n<li>Run setup.py\n<pre class=\"lang:default decode:true\">teddy@teddy-K43SJ:~\/Documents\/python$ sudo python setup.py install<\/pre>\n<p>Here I have to use &#8216;sudo&#8217; to install because I got permission problem without it<\/li>\n<li>Test\n<pre class=\"lang:default decode:true \">teddy@teddy-K43SJ:~\/Documents\/python$ pdf2txt.py -o export1.txt export1.pdf<\/pre>\n<p>BEST OF THE OTHERS BUT STILL NOT GOOD! SELECT ALL THE TEXT THEN COPY PASTE STILL BETTER!<\/li>\n<\/ol>\n<p>I TRIED TO INSTALL <a href=\"http:\/\/myprojects.advchaweb.com\/index.php\/2017\/02\/02\/installing-tesseract-ocr-on-ubuntu-14-04\/\">TESSERACT<\/a> BUT EVEN WORSE THEN ABOVE!!!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reference: http:\/\/okfnlabs.org\/blog\/2016\/04\/19\/pdf-tools-extract-text-and-data-from-pdfs.html I already tested and installed a few good open source code, like pdfparser, pdfbox and pdfminer. PDFPARSER: &nbsp; clone from https:\/\/github.com\/smalot\/pdfparser composer install test by creating a new file index.php in \/works\/pdfparser\/index.php &lt;?php ini_set(&#8216;display_errors&#8217;, 1); ini_set(&#8216;display_startup_errors&#8217;, 1); error_reporting(E_ALL); \/\/ Include Composer autoloader if not already done. \/\/include &#8216;vendor\/autoload.php&#8217;; require_once(&#8216;vendor\/autoload.php&#8217;); \/\/ Parse pdf file &hellip; <a href=\"https:\/\/myprojects.advchaweb.com\/index.php\/2017\/02\/02\/pdf-to-text-converter\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;PDF TO TEXT CONVERTER&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[48,19,58],"tags":[],"class_list":["post-1702","post","type-post","status-publish","format-standard","hentry","category-php-2","category-python","category-tesseract"],"_links":{"self":[{"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/posts\/1702","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/comments?post=1702"}],"version-history":[{"count":5,"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/posts\/1702\/revisions"}],"predecessor-version":[{"id":1707,"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/posts\/1702\/revisions\/1707"}],"wp:attachment":[{"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/media?parent=1702"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/categories?post=1702"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/myprojects.advchaweb.com\/index.php\/wp-json\/wp\/v2\/tags?post=1702"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}