Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

CAM::PDF did't extract all pdf's content

by Gangabass (Vicar)
on Jun 28, 2007 at 03:26 UTC ( #623794=perlquestion: print w/replies, xml ) Need Help??
Gangabass has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a number of pdf's files which i need to search in. So i need to get content of pdf file... I'll try many modules from CPAN and CAM::PDF looks good for me. But I have little problem with it: on my PDF's (in ukrainian language) CAM::PDF does't return all content. The script is very simple (here link to 2.pdf -- ATTENTION! Cyrillic charset!):
#!/usr/bin/perl use strict; use warnings; use CAM::PDF; use CAM::PDF::PageText; my $pdf = CAM::PDF->new("2.pdf"); my $pageone_tree = $pdf->getPageContentTree(1); open TEST, ">", "test.txt" or die $!; print TEST CAM::PDF::PageText->render($pageone_tree); close TEST;
It looks like so (document part that did't converted to normal text):


I think it's not CAM::PDF problem (because pdftotext return same result) but may be i'm wrong? Could you suggest the way to fixing this bug?
P. S. I create simple html-page there i on pictures show how data is lost).

Replies are listed 'Best First'.
Re: CAM::PDF did't extract all pdf's content
by almut (Canon) on Jun 28, 2007 at 17:15 UTC

    I'm afraid I can't really help you much. However, having messed with stuff like this myself quite a lot, I thought I'd post some background info on what I think the problem is — FWIW. Maybe someone else can recommend a tool that does handle this properly (without a lot of extra manual work).

    Many modern PDF tools (such as Acrobat Distiller - which did create the PDF in question), use so-called font subsetting techniques when embedding non-standard fonts into the PDF file. I.e., in an attempt to keep the file size small (and presumably also to make it harder to extract/steal non-free fonts, etc.), such tools embed only exactly the glyphs1 required for a certain body of text.

    For example, to render the word "Perl" (for a headline in some special font, or some such), you'd just need the four glyphs 'P', 'e', 'r' and 'l' — so why embed the whole font? Instead, a new derived mini font is embedded containing nothing but those four glyphs. Also - and this is the problem - a special custom encoding vector is being created, which maps individual character encodings to the appropriate glyph numbers within the embedded subset. In other words, the original encoding (be it ASCII, UTF8 or whatever) might be recoded internally as follows

    P --> 1 e --> 2 r --> 3 l --> 4

    So, instead of the ASCII sequence 80,101,114,108, the word 'Perl' is now internally encoded as 1,2,3,4. This means that whenever the integer 1 is encountered within the string of text to draw, the procedure for rendering the glyph 'P' is being called. The particular mapping is essentially arbitrary (it typically depends on which letters are being encountered first when processing the text), though this is fine as long as glyph subset and encoding are kept consistent.

    (Actually, this is simplified slightly, and the individual techniques vary somewhat &mdash but this description should suffice to explain the problem.)

    The issue is that, in order to get back at the textual content, you need additional info, i.e. the reverse mapping from the internally used encoding to the characters being represented.

    For this, a lookup table for each font is (optionally) embedded within the PDF, that maps internal encoding to some known/standard encoding (typically unicode). For example, in your PDF you'd find tables such as (after having uncompressed it with pdftk2)

    /CMapName /F3+0 def /CMapType 2 def 1 begincodespacerange <01> <37> endcodespacerange 14 beginbfchar <01> <0425> <02> <0440> <03> <0435> <04> <0449> <05> <0430> <06> <0442> <07> <0438> <08> <043A> <09> <0432> <0a> <043D> <0b> <044F> <20> <0020> <30> <0030> <37> <0037> endbfchar ...

    (the left column is the internal encoding, the right one the unicode codepoints)

    In case you're interested in the details, you might want to read chapter 5 of the PDF Reference, in particular section 5.9 "Extraction of Text Content" and 5.6 "Composite Fonts"

    Actually, in practice this stuff can get pretty complex (for example with TrueType fonts), which is why most free tools just don't care to implement it properly (or at all). In particular, as far as I can tell from looking at the source, CAM::PDF is not making any attempt to handle this kind of thing...

    (P.S. I might play with this some more, if I should find the time — for example, I haven't yet investigated what PDF::API2 might have to offer in this respect...  In case I should find some solution, I'll post an update.)


    1  glyph is the term for the visual representation / rendered shape of a certain character, e.g. the glyph for the character 'i' is typically some vertical bar with a dot slightly above it.

    2  using the command pdftk 2.pdf output 2.u.pdf uncompress

      You really help me. Now i know the problem and could try to solve it.
      And of course i appreciate very much if you find a solution and post it here.
        The solution is to buy commercial software ...
Re: CAM::PDF did't extract all pdf's content
by talexb (Canon) on Jun 28, 2007 at 15:18 UTC

    The short answer is that the characters in this document are what we call 'obfuscated', and you probably won't be able to use any off-the-shelf CPAN module to get the Cyrillic content out of PDFs like this.

    However, my employer is in the business of extracting XML from PDFs, and we have the expertise to handle exactly this type of conversion. Please feel free to /msg me for more information.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://623794]
Approved by GrandFather
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2018-04-21 10:06 GMT
Find Nodes?
    Voting Booth?