Pathologically Eclectic Rubbish Lister | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
I am not aware of a CPAN-module that offers a kind of extract_table(page => 42, row => 1, column => 3); method. Creating that wouldn't be easy since the PDF-operators a more like plotter commands plotting on a sheet of paper, so there is no markup like a <TABLE> in HTML which defines some embedded object. Are your PDF files generated automatically, that is to say in a repeatable fashion? I once managed to extract table based information from a series of automatically generated PDF files after converting them into Postscript using pdftops (not: pdf2ps) and some heuristics. Quite a game of chance... but maybe it works for you too? Same approach: CAM::PDF comes with a tool rewritepdf.pl which allows to decompress the internal object streams (-d switch). Analysing the decompressed PDF file might give some hints. A typical table ENTRY might be embedded like this: 40 0 Td <-- x, y position (Td: goto text position) (ENTRY)Tj <-- ENTRY (Tj: show textThe Wikipedia entry for PDF provides a link to "Portable Document Format: An Introduction for Programmers" which provides a lightweight introduction and a table with common PDF operators. Update: argl, it's rewritepdf.pl In reply to Re: Extracting information from a PDF file
by Perlbotics
|
|