LanX has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to parse PDFs of account balances.
ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.
Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...
So are there moduls to parse PDFs (or texts) by clipping-positions?
And for texts is there anything to reverse the effect of format?
Cheers Rolf
Actually I have two problems:
a) to get the precise word positions,
since pdftohtml -xml doesn't break up at all whitespaces:
<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12. 0036 Kartenverfüg 39,75 -</text>
b) defining 2 dimensional scan templates (reversing format)
I already got pretty far, but I was wondering if there is a recommended way to do it...
other threads about pdf parsing are:
* Re: parse content of PDF file
BTW: It's not an OCR issue, I can get all characters ...
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Parsing PDFs by text position?
by djp (Hermit) on Mar 28, 2010 at 11:02 UTC | |
by deep3101 (Acolyte) on Jun 01, 2011 at 02:05 UTC |