CAM::PDF did't extract all pdf's content

Gangabass has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: CAM::PDF did't extract all pdf's content by almut (Canon) on Jun 28, 2007 at 17:15 UTC
I'm afraid I can't really help you much. However, having messed with stuff like this myself quite a lot, I thought I'd post some background info on what I think the problem is — FWIW. Maybe someone else can recommend a tool that does handle this properly (without a lot of extra manual work). Many modern PDF tools (such as Acrobat Distiller - which did create the PDF in question), use so-called font subsetting techniques when embedding non-standard fonts into the PDF file. I.e., in an attempt to keep the file size small (and presumably also to make it harder to extract/steal non-free fonts, etc.), such tools embed only exactly the glyphs¹ required for a certain body of text. For example, to render the word "Perl" (for a headline in some special font, or some such), you'd just need the four glyphs 'P', 'e', 'r' and 'l' — so why embed the whole font? Instead, a new derived mini font is embedded containing nothing but those four glyphs. Also - and this is the problem - a special custom encoding vector is being created, which maps individual character encodings to the appropriate glyph numbers within the embedded subset. In other words, the original encoding (be it ASCII, UTF8 or whatever) might be recoded internally as follows `P --> 1 e --> 2 r --> 3 l --> 4` [download] So, instead of the ASCII sequence 80,101,114,108, the word 'Perl' is now internally encoded as 1,2,3,4. This means that whenever the integer 1 is encountered within the string of text to draw, the procedure for rendering the glyph 'P' is being called. The particular mapping is essentially arbitrary (it typically depends on which letters are being encountered first when processing the text), though this is fine as long as glyph subset and encoding are kept consistent. (Actually, this is simplified slightly, and the individual techniques vary somewhat &mdash but this description should suffice to explain the problem.) The issue is that, in order to get back at the textual content, you need additional info, i.e. the reverse mapping from the internally used encoding to the characters being represented. For this, a lookup table for each font is (optionally) embedded within the PDF, that maps internal encoding to some known/standard encoding (typically unicode). For example, in your PDF you'd find tables such as (after having uncompressed it with pdftk²) `/CMapName /F3+0 def /CMapType 2 def 1 begincodespacerange <01> <37> endcodespacerange 14 beginbfchar <01> <0425> <02> <0440> <03> <0435> <04> <0449> <05> <0430> <06> <0442> <07> <0438> <08> <043A> <09> <0432> <0a> <043D> <0b> <044F> <20> <0020> <30> <0030> <37> <0037> endbfchar ...` [download] (the left column is the internal encoding, the right one the unicode codepoints) In case you're interested in the details, you might want to read chapter 5 of the PDF Reference, in particular section 5.9 "Extraction of Text Content" and 5.6 "Composite Fonts" Actually, in practice this stuff can get pretty complex (for example with TrueType fonts), which is why most free tools just don't care to implement it properly (or at all). In particular, as far as I can tell from looking at the source, `CAM::PDF` is not making any attempt to handle this kind of thing... (P.S. I might play with this some more, if I should find the time — for example, I haven't yet investigated what `PDF::API2` might have to offer in this respect... In case I should find some solution, I'll post an update.) ___ ¹ glyph is the term for the visual representation / rendered shape of a certain character, e.g. the glyph for the character 'i' is typically some vertical bar with a dot slightly above it. ² using the command `pdftk 2.pdf output 2.u.pdf uncompress`	[reply] [d/l] [select]
Re^2: CAM::PDF did't extract all pdf's content by Gangabass (Vicar) on Jun 28, 2007 at 23:39 UTC
Thanks. You really help me. Now i know the problem and could try to solve it. And of course i appreciate very much if you find a solution and post it here.	[reply]
Re^3: CAM::PDF did't extract all pdf's content by Anonymous Monk on Jun 29, 2007 at 07:19 UTC
The solution is to buy commercial software ...	[reply]
Re^4: CAM::PDF did't extract all pdf's content by Anonymous Monk on Jan 29, 2008 at 16:28 UTC
Re^4: CAM::PDF did't extract all pdf's content by Anonymous Monk on Nov 27, 2008 at 08:09 UTC
Re^5: CAM::PDF did't extract all pdf's content by Anonymous Monk on May 29, 2009 at 07:13 UTC
Re: CAM::PDF did't extract all pdf's content by talexb (Chancellor) on Jun 28, 2007 at 15:18 UTC
The short answer is that the characters in this document are what we call 'obfuscated', and you probably won't be able to use any off-the-shelf CPAN module to get the Cyrillic content out of PDFs like this. However, my employer is in the business of extracting XML from PDFs, and we have the expertise to handle exactly this type of conversion. Please feel free to `/msg` me for more information. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l]


Syntactic Confectionery Delight
	PerlMonks