http://www.perlmonks.org?node_id=831190

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm trying to parse PDFs of account balances.

ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.

Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...

So are there moduls to parse PDFs (or texts) by clipping-positions?

And for texts is there anything to reverse the effect of format?

Cheers Rolf

UPDATE:

Actually I have two problems:

a) to get the precise word positions,

since pdftohtml -xml doesn't break up at all whitespaces:

<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12.    0036 Kartenverfüg                                                  39,75 -</text>

b) defining 2 dimensional scan templates (reversing format)

I already got pretty far, but I was wondering if there is a recommended way to do it...

other threads about pdf parsing are:

* How to parse PDF

* PDF Parsing

* Re: parse content of PDF file

BTW: It's not an OCR issue, I can get all characters ...

Replies are listed 'Best First'.
Re: Parsing PDFs by text position?
by djp (Hermit) on Mar 28, 2010 at 11:02 UTC
    > I'm trying to parse PDFs of account balances.

    Where did this crazy requirement come from?

      how does the PDF file look like when it is converted to TEXT, if it is separated by tabs or conspicuous spaces then you can use it to write it as xls sheet by SpreadSheet::Wright and then you can handle it easily.