Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

How to Extract PDF tables using Perl

by perlPsycho (Initiate)
on May 11, 2016 at 04:53 UTC ( [id://1162687]=perlquestion: print w/replies, xml ) Need Help??

perlPsycho has asked for the wisdom of the Perl Monks concerning the following question:

Perl Monks I seek your wisdom for a Question, I am yet to find an Answer for days. I am using CAM::PDF Module to extract the text content present in PDF in PERL.

My Code:
use CAM::PDF; my $pdfFile=CAM::PDF->new('save.pdf'); my $text = $pdfFile->getPageText(1);
My PDF Table Format:
Date Value1 Value2 03/31/3016 24,17523 0.00015960 02/29/3016 27,69368 0.000177510 01/31/3016 31,64637 0.00020850 12/31/3015 39,89056 0.00025700 11/30/3015 49,58176 0.000317820 10/31/3016 61,14936 0.00091970
My Result:
Date Value1 Value2 03/31/3016 24,17523 0.000154960 02/29/3016 27,69368 0. +000177510 01/31/3016 31,64637 0.000202850 01/31/3016 31 +,64637 0.000202850 ...$

How Should My Output Be:

When Given Run time input : Date, It should Print all the dates present under Date.
When Given Run time input : Value1, It should Print all values present under value1.

My Question:

How Do I read a table present in a PDF using Perl..??
And How can I use it Display my results as such..??
Or is it even possible to read a Table in PDF in PERL..?



THANK YOU SO MUCH IN ADVANCE...
It Would be great if you could Help...

Replies are listed 'Best First'.
Re: How to Extract PDF tables using Perl
by LanX (Saint) on May 11, 2016 at 10:57 UTC
    The best advice I can give you is to use pdftohtml -xml and to parse the coordinates given in the xml output.

    see also Parsing PDFs by text position?

    The hard work - the heuristic to identify rows and colums - is yours.

    Can't be done by us because we don't know the exact requirements and a Perl module can't be more intelligent than you are. ;-)

    Good luck!

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      See my post here for an example that uses the pdftohtml.exe program LanX is referring to

      One caveat though: as LanX mentioned in his link, pdftohtml, under certain circumstances, may not break a tabular line up into its individual columns. Unfortunately this sort of thing is really dependent on the internal structure, version, content, and layout of the PDF. The perils of using a display format as data...

        Another point is that lines for borders will not be represented by pdftohtml, you have to go by text position only.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: How to Extract PDF tables using Perl
by morgon (Priest) on May 11, 2016 at 09:46 UTC
    In the general case (maybe your case is simpler) extracting tables from pdf is a non-trivial task.

    I am not aware of any perl-module that can do it.

    The only software that ever worked for me was http://tabula.technology/ which is a Java-program but uses pretty good heuristics to identify and extracting tables in pdf.

      Thank you for your precious time Morgon.
      I will look into it,
      But my People say its possible.
      And they have done it.
        But my People say its possible. And they have done it.

        Ask them how they did it and then do it that way. Problem solved.

        Is there any one who knows whether
        there is a perl module
        That can be used for Extracting Table from PDF
        And
        How to Do it?
Re: How to Extract PDF tables using Perl
by LanX (Saint) on May 11, 2016 at 08:29 UTC
    Wow 3016 .... have to adjust my clock again.

    The quick and dirty way to do this is to split /\s+/, $text

    now loop over the resulting array

    Like

    while ( my $col = shift @array) { $date{$col} = [ shift @array, shift @array]; }

    Untested!

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      :D :D yeah 3016

      Thanks for the reply.
      The problem here is that the table is dynamic.

      So there may be 3 labels or 30 labels like Date,Value1and Value2
      or there may be a lot.

      Some of them might be undefined.

      Are there any modules that might help me Parse a PDF table.??

      So Far CAM::PDF and PDF::API2 does not have the feature of reading a table inside a pdf, only Creating a new one.

      Main Problem:The values get mixed and printed in a single line,
      1.)So Some of these values might not be defined(Just Empty Sets),

      And the labels keep changing,So They are not static at all.


      Any Advises or Ideas on Modules or How to do it Please..?

        i use perl but when trying to do something similar, i found using python3 + pdfquery seemed to work easier & did the column parsing...

        http://www.markhneedham.com/blog/2015/01/22/pythonpdfquery-scraping-the-fifa-world-player-of-the-year-votes-pdf-into-shape/

        i guess the nutshell is loop over each page in pdf, search for matching string, if found, get its x,y coordinates, use that result in_bbox(x,y,x2,y2) to scrape whatever else text might be inside this bounding box - because i wanted a "row" my bbox was x,y,x+500,y+10 ( grid origin at bottom left?)

        i don't know how it really works, but i was able to copy/paste enough bits to get what i needed

        maybe pdf::api or something can have similar feature in_bbox? is it maybe like a collision detection logic where given bounding box, find all text thingys that collide with it and return an array of those? i'm guessing out my a##

        sorry if this doesn't help

Re: How to Extract PDF tables using Perl
by ablanke (Monsignor) on May 27, 2016 at 12:58 UTC
    Hi, the solution i've seen is to use:
    $doc->getPageContent($pagenum);
    instead of:
    $doc->getPageText($pagenum);

    But even if the solution sounds simple. There is work for you to do.

    You will have to parse the return value of getPageContent.

    Here is an Possible Example of PageContent:

    9.9213 0 Td Content Tj

    The 2 Numbers before the Td tell you the Position of the Content.

    UPDATE: This gives you a HashRef of your Page:
    $doc->getPageContentTree($pagenum)
      Td means table data like in html and "Tj" is the cell's text???

      The PDF you are parsing seems to have preserved semantic information, I suppose this approach depends on the way it was generated.

      I doubt this is generally true. (?)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        Yes, that's right.

        It always depends on the way the PDF was generated. (some PDF tools even position every single character)

        Maybe the getPageContentTree method helps to build a more generally solution.

        The example based on the solution i've seen.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1162687]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-26 02:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found