http://www.perlmonks.org?node_id=1040266


in reply to exporting PDF::API2

K0V6wNoH.pdf: 407386 bytes, can-print yes, can-modif yes, can-copy yes, can-add yes

(You can find the script utilized in my blog)

So the problem here is not that your pdf can't be copied (or opened to read). To lose the select/copy text ability is a common trouble when a pdf is repaired/fixed with some programs.

In short: a pdf can be optimized if you convert it again to ps, and then you use ghostscript (or ps2pdf) to recreate it. You obtain a smaller archive but you can lose more modern features and things like text selection, (typically in the first pass). The cause avoiding you to select a text can be, i.e., in the encoded fonts utilized.

You can fix this with Adobe software (or probably messing with the inner structure). Take a look to the documentation section of the CAM-PDF module from Chris Dolan with a lot of useful (and better) perl scripts to analize your pdf

Replies are listed 'Best First'.
Re^2: exporting PDF::API2
by bulk88 (Priest) on Jun 23, 2013 at 03:25 UTC
    In some PDFs, especially ones created from lets say Illustrator to Acrobat through Adobe Distiller. Each letter of text gets flattened into a color filled polygon/beizer curves. There is no text in the COS tree of the PDF, just postscript polygon draw operators. I think OCR is the only way to get back computer meaning of the text. A WAG says since it all came from one font in a vector graphics program, you could try to programatically checksum each polygon against a known checksum of the polygon of each letter which was human IDed. I would look for a library that does this already, implementing on your own is insane.