Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Parsing PDFs by text position?

by LanX (Bishop)
on Mar 26, 2010 at 16:33 UTC ( #831190=perlquestion: print w/replies, xml ) Need Help??
LanX has asked for the wisdom of the Perl Monks concerning the following question:


I'm trying to parse PDFs of account balances.

ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.

Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...

So are there moduls to parse PDFs (or texts) by clipping-positions?

And for texts is there anything to reverse the effect of format?

Cheers Rolf


Actually I have two problems:

a) to get the precise word positions,

since pdftohtml -xml doesn't break up at all whitespaces:

<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12.    0036 Kartenverfüg                                                  39,75 -</text>

b) defining 2 dimensional scan templates (reversing format)

I already got pretty far, but I was wondering if there is a recommended way to do it...

other threads about pdf parsing are:

* How to parse PDF

* PDF Parsing

* Re: parse content of PDF file

BTW: It's not an OCR issue, I can get all characters ...

Replies are listed 'Best First'.
Re: Parsing PDFs by text position?
by djp (Hermit) on Mar 28, 2010 at 11:02 UTC
    > I'm trying to parse PDFs of account balances.

    Where did this crazy requirement come from?

      how does the PDF file look like when it is converted to TEXT, if it is separated by tabs or conspicuous spaces then you can use it to write it as xls sheet by SpreadSheet::Wright and then you can handle it easily.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://831190]
Approved by marto
Front-paged by Old_Gray_Bear
[Corion]: Wheee! The videos of the German Perl Workshop 2018 are online :)

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2018-07-23 15:50 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (471 votes). Check out past polls.