I'm afraid I can't really help you much. However, having messed with stuff like this myself quite a lot, I thought I'd post some background info
on what I think the problem is — FWIW. Maybe someone else can recommend a tool
that does handle this properly (without a lot of extra manual work).
Many modern PDF tools (such as Acrobat Distiller - which did create
the PDF in question), use so-called font subsetting techniques
when embedding non-standard fonts into the PDF file. I.e., in an
attempt to keep the file size small (and presumably also to make it
harder to extract/steal non-free fonts, etc.), such tools embed only
exactly the glyphs1 required for a certain body of text.
For example, to render the word "Perl" (for a headline in some
special font, or some such), you'd just need the four glyphs 'P', 'e',
'r' and 'l' — so why embed the whole font? Instead, a new derived
mini font is embedded containing nothing but those four glyphs. Also -
and this is the problem - a special custom encoding vector is being
created, which maps individual character encodings to the appropriate
glyph numbers within the embedded subset. In other words, the original
encoding (be it ASCII, UTF8 or whatever) might be recoded internally as
follows
P --> 1
e --> 2
r --> 3
l --> 4
So, instead of the ASCII sequence 80,101,114,108, the word 'Perl'
is now internally encoded as 1,2,3,4. This means that whenever the
integer 1 is encountered within the string of text to draw, the procedure
for rendering the glyph 'P' is being called. The particular mapping is essentially arbitrary (it typically depends on which letters are being
encountered first when processing the text), though this is fine as long
as glyph subset and encoding are kept consistent.
(Actually, this is simplified slightly, and the individual techniques
vary somewhat &mdash but this description should suffice to explain the
problem.)
The issue is that, in order to get back at the textual content,
you need additional info, i.e. the reverse mapping from the
internally used encoding to the characters being represented.
For this, a lookup table for each font is (optionally) embedded within the PDF,
that maps internal encoding to some known/standard encoding (typically
unicode). For example, in your PDF you'd find tables such as (after
having uncompressed it with pdftk2)
/CMapName /F3+0 def
/CMapType 2 def
1 begincodespacerange <01> <37> endcodespacerange
14 beginbfchar
<01> <0425>
<02> <0440>
<03> <0435>
<04> <0449>
<05> <0430>
<06> <0442>
<07> <0438>
<08> <043A>
<09> <0432>
<0a> <043D>
<0b> <044F>
<20> <0020>
<30> <0030>
<37> <0037>
endbfchar
...
(the left column is the internal encoding, the right one the
unicode codepoints)
In case you're interested in the details, you might want to read
chapter 5 of the PDF Reference,
in particular section 5.9 "Extraction of Text Content" and 5.6 "Composite Fonts"
Actually, in practice this stuff can get pretty complex (for
example with TrueType fonts), which is why most free tools just don't care to
implement it properly (or at all). In particular, as far as I can tell
from looking at the source, CAM::PDF is not making any attempt to
handle this kind of thing...
(P.S. I might play with this some more, if I should find the time
— for example, I haven't yet investigated what PDF::API2
might have to offer in this respect... In case I should find some
solution, I'll post an update.)
___
1 glyph is the term for the visual representation
/ rendered shape of a certain character, e.g. the glyph for the character
'i' is typically some vertical bar with a dot slightly above it.
2 using the command pdftk 2.pdf output 2.u.pdf uncompress
|