more useful options | |
PerlMonks |
Parsing PDFs by text position?by LanX (Saint) |
on Mar 26, 2010 at 16:33 UTC ( [id://831190]=perlquestion: print w/replies, xml ) | Need Help?? |
LanX has asked for the wisdom of the Perl Monks concerning the following question:
Hi I'm trying to parse PDFs of account balances. ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns. Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ... So are there moduls to parse PDFs (or texts) by clipping-positions? And for texts is there anything to reverse the effect of format?
Cheers Rolf Actually I have two problems: a) to get the precise word positions, since pdftohtml -xml doesn't break up at all whitespaces: <text top="239" left="33" width="491" height="7" font="2">28.12. 28.12. 0036 Kartenverfüg 39,75 -</text> b) defining 2 dimensional scan templates (reversing format) I already got pretty far, but I was wondering if there is a recommended way to do it... other threads about pdf parsing are: * Re: parse content of PDF file BTW: It's not an OCR issue, I can get all characters ...
Back to
Seekers of Perl Wisdom
|
|