How to Extract PDF tables using Perl

perlPsycho has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to Extract PDF tables using Perl by LanX (Saint) on May 11, 2016 at 10:57 UTC
The best advice I can give you is to use `pdftohtml -xml` and to parse the coordinates given in the xml output. see also Parsing PDFs by text position? The hard work - the heuristic to identify rows and colums - is yours. Can't be done by us because we don't know the exact requirements and a Perl module can't be more intelligent than you are. ;-) Good luck! Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^2: How to Extract PDF tables using Perl by ateague (Monk) on May 11, 2016 at 14:08 UTC
See my post here for an example that uses the pdftohtml.exe program LanX is referring to One caveat though: as LanX mentioned in his link, pdftohtml, under certain circumstances, may not break a tabular line up into its individual columns. Unfortunately this sort of thing is really dependent on the internal structure, version, content, and layout of the PDF. The perils of using a display format as data...	[reply]
Re^3: How to Extract PDF tables using Perl by LanX (Saint) on May 11, 2016 at 15:41 UTC
Another point is that lines for borders will not be represented by pdftohtml, you have to go by text position only. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re: How to Extract PDF tables using Perl by morgon (Priest) on May 11, 2016 at 09:46 UTC
In the general case (maybe your case is simpler) extracting tables from pdf is a non-trivial task. I am not aware of any perl-module that can do it. The only software that ever worked for me was http://tabula.technology/ which is a Java-program but uses pretty good heuristics to identify and extracting tables in pdf.	[reply]
Re^2: How to Extract PDF tables using Perl by perlPsycho (Initiate) on May 11, 2016 at 09:54 UTC
Thank you for your precious time Morgon. I will look into it, But my People say its possible. And they have done it.	[reply]
Re^3: How to Extract PDF tables using Perl by hippo (Bishop) on May 11, 2016 at 10:20 UTC
But my People say its possible. And they have done it. Ask them how they did it and then do it that way. Problem solved.	[reply]
Re^4: How to Extract PDF tables using Perl by MidLifeXis (Monsignor) on May 11, 2016 at 11:07 UTC
Re^5: How to Extract PDF tables using Perl by LanX (Saint) on May 11, 2016 at 11:33 UTC
Re^3: How to Extract PDF tables using Perl by perlPsycho (Initiate) on May 11, 2016 at 10:12 UTC
Is there any one who knows whether there is a perl module That can be used for Extracting Table from PDF And How to Do it?	[reply]
Re: How to Extract PDF tables using Perl by LanX (Saint) on May 11, 2016 at 08:29 UTC
Wow 3016 .... have to adjust my clock again. The quick and dirty way to do this is to `split /\s+/, $text` now loop over the resulting array Like `while ( my $col = shift @array) { $date{$col} = [ shift @array, shift @array]; }` [download] Untested! Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: How to Extract PDF tables using Perl by perlPsycho (Initiate) on May 11, 2016 at 09:39 UTC
:D :D yeah 3016 Thanks for the reply. The problem here is that the table is dynamic. So there may be 3 labels or 30 labels like Date,Value1and Value2 or there may be a lot. Some of them might be undefined. Are there any modules that might help me Parse a PDF table.?? So Far CAM::PDF and PDF::API2 does not have the feature of reading a table inside a pdf, only Creating a new one. Main Problem:The values get mixed and printed in a single line, 1.)So Some of these values might not be defined(Just Empty Sets), And the labels keep changing,So They are not static at all. Any Advises or Ideas on Modules or How to do it Please..?	[reply]
Re^3: How to Extract PDF tables using Perl by Anonymous Monk on May 25, 2016 at 05:54 UTC
i use perl but when trying to do something similar, i found using python3 + pdfquery seemed to work easier & did the column parsing... http://www.markhneedham.com/blog/2015/01/22/pythonpdfquery-scraping-the-fifa-world-player-of-the-year-votes-pdf-into-shape/ i guess the nutshell is loop over each page in pdf, search for matching string, if found, get its x,y coordinates, use that result in_bbox(x,y,x2,y2) to scrape whatever else text might be inside this bounding box - because i wanted a "row" my bbox was x,y,x+500,y+10 ( grid origin at bottom left?) i don't know how it really works, but i was able to copy/paste enough bits to get what i needed maybe pdf::api or something can have similar feature in_bbox? is it maybe like a collision detection logic where given bounding box, find all text thingys that collide with it and return an array of those? i'm guessing out my a## sorry if this doesn't help	[reply]
Re^4: How to Extract PDF tables using Perl by Anonymous Monk on May 25, 2016 at 06:04 UTC
Re^4: How to Extract PDF tables using Perl by perlPsycho (Initiate) on May 27, 2016 at 06:53 UTC
Re: How to Extract PDF tables using Perl by ablanke (Monsignor) on May 27, 2016 at 12:58 UTC
Hi, the solution i've seen is to use: `$doc->getPageContent($pagenum);` [download] instead of: `$doc->getPageText($pagenum);` [download] But even if the solution sounds simple. There is work for you to do. You will have to parse the return value of getPageContent. Here is an Possible Example of PageContent: `9.9213 0 Td Content Tj` [download] The 2 Numbers before the Td tell you the Position of the Content. UPDATE: This gives you a HashRef of your Page: `$doc->getPageContentTree($pagenum)` [download]	[reply] [d/l] [select]
Re^2: How to Extract PDF tables using Perl by LanX (Saint) on May 27, 2016 at 13:22 UTC
Td means table data like in html and "Tj" is the cell's text??? The PDF you are parsing seems to have preserved semantic information, I suppose this approach depends on the way it was generated. I doubt this is generally true. (?) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: How to Extract PDF tables using Perl by ablanke (Monsignor) on May 27, 2016 at 13:38 UTC
Yes, that's right. It always depends on the way the PDF was generated. (some PDF tools even position every single character) Maybe the `getPageContentTree` method helps to build a more generally solution. The example based on the solution i've seen.	[reply] [d/l]
Re^4: How to Extract PDF tables using Perl by LanX (Saint) on May 27, 2016 at 15:51 UTC