|Perl: the Markov chain saw|
Parsing PDFs by text position?by LanX (Chancellor)
|on Mar 26, 2010 at 16:33 UTC||Need Help??|
LanX has asked for the
wisdom of the Perl Monks concerning the following question:
I'm trying to parse PDFs of account balances.
ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.
Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...
So are there moduls to parse PDFs (or texts) by clipping-positions?
And for texts is there anything to reverse the effect of format?
Actually I have two problems:
a) to get the precise word positions,
since pdftohtml -xml doesn't break up at all whitespaces:
<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12. 0036 Kartenverfüg 39,75 -</text>
b) defining 2 dimensional scan templates (reversing format)
I already got pretty far, but I was wondering if there is a recommended way to do it...
other threads about pdf parsing are:
BTW: It's not an OCR issue, I can get all characters ...