exporting PDF::API2

tqisjim has asked for the wisdom of the Perl Monks concerning the following question:

I have had terrific luck with PDF::API2. I rely on it heavily for lots of day-to-day documentation.

I just discovered a problem: Unlike other PDF documents, my output cannot be converted to another format or cut/copied from a PDF viewer. To demonstrate, please feel free to download my resume: http://www.tqis.com/drive/tqis/0B4ZeuWCdyYETNzU3U0lFV3d2TGc/. Then try to open it in something like LibreOffice Writer

Any ideas? Thanks!

Jim

Comment on exporting PDF::API2

Replies are listed 'Best First'.
Re: exporting PDF::API2 by CountZero (Bishop) on Jun 20, 2013 at 18:07 UTC
You forgot to mention Perl in your list of technical skills! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^2: exporting PDF::API2 by tqisjim (Beadle) on Jun 20, 2013 at 18:12 UTC
That's really funny. Here in unemployed Michigan, there's a state agency that performs resume consulting. The resume is about a year old, and I only remembered a couple days ago that the consultant thought perl was a liability on my resume. One of the keynotes at YAPC-NA 2013 was a talk entitled Perl- The Detroit of Scripting Languages. Lots of fun for analogies- in this case, the State of Michigan hates both. ;)	[reply]
Re^3: exporting PDF::API2 by CountZero (Bishop) on Jun 20, 2013 at 18:32 UTC
Here in unemployed Michigan, there's a state agency that performs resume consulting. The resume is about a year old, and I only remembered a couple days ago that the consultant thought perl was a liability on my resume. With state agencies giving such nonsensical advice, no wonder there is much unemployment in their state. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^3: exporting PDF::API2 by karlgoethebier (Abbot) on Jun 20, 2013 at 19:54 UTC
"Lots of fun for analogies..." Motorcity is burning Motorcity MC5 It's Friday, you ain't got no job, and you ain't got shit to do My best regards, Karl ŤThe Crux of the Biscuit is the Apostropheť	[reply]
Re: exporting PDF::API2 by tqisjim (Beadle) on Jun 20, 2013 at 17:38 UTC
FWIW, I tried using CAM::PDF: `$pdf = new CAM::PDF 'Jim Schueler.pdf' ; print $pdf->getPageText(1) ;` [download] The results are about the same. -Jim	[reply] [d/l]
Re^2: exporting PDF::API2 by tqisjim (Beadle) on Jun 20, 2013 at 18:08 UTC
Also got the same results using PDF::API3::Compat::API2 .	[reply]
Re^2: exporting PDF::API2 by jakeease (Friar) on Jun 22, 2013 at 07:37 UTC
I notice that CAM::PDF has a `canCopy()` method that sets permission to copy. Have you tried it?	[reply] [d/l]
Re: exporting PDF::API2 by pvaldes (Chaplain) on Jun 22, 2013 at 10:28 UTC
`K0V6wNoH.pdf: 407386 bytes, can-print yes, can-modif yes, can-copy yes, can-add yes` (You can find the script utilized in my blog) So the problem here is not that your pdf can't be copied (or opened to read). To lose the select/copy text ability is a common trouble when a pdf is repaired/fixed with some programs. In short: a pdf can be optimized if you convert it again to ps, and then you use ghostscript (or ps2pdf) to recreate it. You obtain a smaller archive but you can lose more modern features and things like text selection, (typically in the first pass). The cause avoiding you to select a text can be, i.e., in the encoded fonts utilized. You can fix this with Adobe software (or probably messing with the inner structure). Take a look to the documentation section of the CAM-PDF module from Chris Dolan with a lot of useful (and better) perl scripts to analize your pdf	[reply] [d/l]
Re^2: exporting PDF::API2 by bulk88 (Priest) on Jun 23, 2013 at 03:25 UTC
In some PDFs, especially ones created from lets say Illustrator to Acrobat through Adobe Distiller. Each letter of text gets flattened into a color filled polygon/beizer curves. There is no text in the COS tree of the PDF, just postscript polygon draw operators. I think OCR is the only way to get back computer meaning of the text. A WAG says since it all came from one font in a vector graphics program, you could try to programatically checksum each polygon against a known checksum of the polygon of each letter which was human IDed. I would look for a library that does this already, implementing on your own is insane.	[reply]
Re: exporting PDF::API2 by tqisjim (Beadle) on Jun 21, 2013 at 20:58 UTC
I'm not sure any of the PDF::API modules have an active maintainer. And I'm not looking for any new projects myself. But if anyone else wants to pursue this, this link on StackOverflow seems like a good starting point: http://stackoverflow.com/questions/12596020/enabling-select-and-copy-of-text-content-in-pdf	[reply]
Re: exporting PDF::API2 by Beechbone (Friar) on Jun 25, 2013 at 10:23 UTC
try: `$font = $pdf->ttfont($fontfile, -encode => 'udf8', -isocmap => 1, -unicodemap => 1);` That's what I use to get copy&paste-able PDFs. _Search, Ask, ^Know	[reply] [d/l]
Re^2: exporting PDF::API2 by tqisjim (Beadle) on Jul 31, 2013 at 17:50 UTC
Cutting and pasting now works in my PDF file. But C&P did not work in your proposed solution: I had to s/udf/utf/. I did a little additional testing: Of the flags in your response, only the -unicodemap seems required. Although the utf8 encoding flag's functionality is clear, the other two flags do not effect embedded text. While digging in the source code, I discovered that PDF::API2 sets -unicodemap by default- although PDF::API3::Compat::API2 does not. Sure enough, when I went back and tested with PDF::API2, the text was embedded. Maybe this is a bug in PDF::API3::Compat::API2. Even so, I only encountered it trying to fix a bug I didn't have in the first place. :(	[reply]

Back to Seekers of Perl Wisdom