Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

PDF decoding in Perl

by Arik123 (Sexton)
on Mar 06, 2017 at 07:17 UTC ( #1183735=perlquestion: print w/replies, xml ) Need Help??

Arik123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I have a PDF file which contains a filled form. Unfortunately the information (text-only) isn't plain ASCII. I nned a perl script to extract the information and process it, but I can't get anything except gibberish. I figured it's condensed somehow, so I used QPDF to make the file more human-readable.

Now there are multiple objects whose content is something like


which seem to be the content of the fields, in some encoding. There are also some objects that look like:

/BaseFont /RCZMJK+TimesNewRoman /DescendantFonts 13 0 R /Encoding /Identity-H /Subtype /Type0 /ToUnicode 93 0 R /Type /Font

while the /ToUnicode information refes to objects that look like:

93 0 obj << /Length 94 0 R >> stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 4 beginbfchar <02A8> <05D8> <02A9> <05D9> <02B4> <05E4> <02B8> <05E8> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj

I need some perl script (or a module) that can make sense of all that (to me it looks like Turkish. Hint: I don't speak Turkish) and convert it to utf-8 or some other encoding that makes sense.

Any help would be appreciated.

Replies are listed 'Best First'.
Re: PDF decoding in Perl
by beech (Parson) on Mar 06, 2017 at 07:25 UTC
Re: PDF decoding in Perl
by vr (Curate) on Mar 06, 2017 at 11:24 UTC

    You won't solve it without consulting the PDF Reference and some rather low-level and verbose code. If you are lucky, it's indeed PDF Form, i.e. not text as page content, and not XFA form.

    Try getFormFieldList method, then getFormField to check them all or access a field with known name. The V entry in field dictionary is its text "value", either in PDFDocEncoding (plain ASCII, for most practical purposes), or UTF16-BE with prepended BOM, as in your example (which is Hebrew).

Re: PDF decoding in Perl
by huck (Parson) on Mar 06, 2017 at 07:35 UTC
Re: PDF decoding in Perl
by karlgoethebier (Abbot) on Mar 06, 2017 at 10:36 UTC

    Quick shot: Image::ExifTool?

    The Crux of the Biscuit is the Apostrophe»

    «Furthermore I consider that Donald Trump must be impeached as soon as possible»

      OMG Perl no matter how many years will pass it will never stop to amaze...incredible module...I had no clue, thanks for point it out.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: PDF decoding in Perl
by Arik123 (Sexton) on Mar 08, 2017 at 09:42 UTC

    Thanks, monk! You've been tremendously helpful! I now do something like:

    use CAM::PDF; use Encode "from_to"; my $pdf = CAM::PDF->new('myfile.pdf'); for ($pdf->getFormFieldList) { my $val = $pdf->getFormField($_)->{value}{value}{V}{value}; if ($val =~ /^\x{fe}\x{ff}/) {from_to ($val,"UTF-16BE", "utf8")} print "$_ => $val\n"; }

    and it works perfectly!

Re: PDF decoding in Perl
by Arik123 (Sexton) on Mar 06, 2017 at 07:28 UTC

    I tried CAM::PDF. It doesn't do any decompression nor decoding.

      I tried CAM::PDF. It doesn't do any decompression nor decoding.


      What does it do?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1183735]
Approved by beech
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2020-11-29 11:28 GMT
Find Nodes?
    Voting Booth?

    No recent polls found