Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Parsing PDFs by text position?

by LanX (Canon)
on Mar 26, 2010 at 16:33 UTC ( #831190=perlquestion: print w/ replies, xml ) Need Help??
LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm trying to parse PDFs of account balances.

ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.

Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...

So are there moduls to parse PDFs (or texts) by clipping-positions?

And for texts is there anything to reverse the effect of format?

Cheers Rolf

UPDATE:

Actually I have two problems:

a) to get the precise word positions,

since pdftohtml -xml doesn't break up at all whitespaces:

<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12.    0036 Kartenverfüg                                                  39,75 -</text>

b) defining 2 dimensional scan templates (reversing format)

I already got pretty far, but I was wondering if there is a recommended way to do it...

other threads about pdf parsing are:

* How to parse PDF

* PDF Parsing

* Re: parse content of PDF file

BTW: It's not an OCR issue, I can get all characters ...

Comment on Parsing PDFs by text position?
Select or Download Code
Re: Parsing PDFs by text position?
by djp (Hermit) on Mar 28, 2010 at 11:02 UTC
    > I'm trying to parse PDFs of account balances.

    Where did this crazy requirement come from?

      how does the PDF file look like when it is converted to TEXT, if it is separated by tabs or conspicuous spaces then you can use it to write it as xls sheet by SpreadSheet::Wright and then you can handle it easily.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://831190]
Approved by marto
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (15)
As of 2014-12-22 20:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (130 votes), past polls