|Perl: the Markov chain saw|
Re: CAM::PDF did't extract all pdf's contentby almut (Canon)
|on Jun 28, 2007 at 17:15 UTC||Need Help??|
I'm afraid I can't really help you much. However, having messed with stuff like this myself quite a lot, I thought I'd post some background info on what I think the problem is — FWIW. Maybe someone else can recommend a tool that does handle this properly (without a lot of extra manual work).
Many modern PDF tools (such as Acrobat Distiller - which did create the PDF in question), use so-called font subsetting techniques when embedding non-standard fonts into the PDF file. I.e., in an attempt to keep the file size small (and presumably also to make it harder to extract/steal non-free fonts, etc.), such tools embed only exactly the glyphs1 required for a certain body of text.
For example, to render the word "Perl" (for a headline in some special font, or some such), you'd just need the four glyphs 'P', 'e', 'r' and 'l' — so why embed the whole font? Instead, a new derived mini font is embedded containing nothing but those four glyphs. Also - and this is the problem - a special custom encoding vector is being created, which maps individual character encodings to the appropriate glyph numbers within the embedded subset. In other words, the original encoding (be it ASCII, UTF8 or whatever) might be recoded internally as follows
So, instead of the ASCII sequence 80,101,114,108, the word 'Perl' is now internally encoded as 1,2,3,4. This means that whenever the integer 1 is encountered within the string of text to draw, the procedure for rendering the glyph 'P' is being called. The particular mapping is essentially arbitrary (it typically depends on which letters are being encountered first when processing the text), though this is fine as long as glyph subset and encoding are kept consistent.
(Actually, this is simplified slightly, and the individual techniques vary somewhat &mdash but this description should suffice to explain the problem.)
The issue is that, in order to get back at the textual content, you need additional info, i.e. the reverse mapping from the internally used encoding to the characters being represented.
For this, a lookup table for each font is (optionally) embedded within the PDF, that maps internal encoding to some known/standard encoding (typically unicode). For example, in your PDF you'd find tables such as (after having uncompressed it with pdftk2)
(the left column is the internal encoding, the right one the unicode codepoints)
In case you're interested in the details, you might want to read chapter 5 of the PDF Reference, in particular section 5.9 "Extraction of Text Content" and 5.6 "Composite Fonts"
Actually, in practice this stuff can get pretty complex (for example with TrueType fonts), which is why most free tools just don't care to implement it properly (or at all). In particular, as far as I can tell from looking at the source, CAM::PDF is not making any attempt to handle this kind of thing...
(P.S. I might play with this some more, if I should find the time — for example, I haven't yet investigated what PDF::API2 might have to offer in this respect... In case I should find some solution, I'll post an update.)
1 glyph is the term for the visual representation / rendered shape of a certain character, e.g. the glyph for the character 'i' is typically some vertical bar with a dot slightly above it.
2 using the command pdftk 2.pdf output 2.u.pdf uncompress