http://www.perlmonks.org?node_id=1040005

tqisjim has asked for the wisdom of the Perl Monks concerning the following question:

I have had terrific luck with PDF::API2. I rely on it heavily for lots of day-to-day documentation.

I just discovered a problem: Unlike other PDF documents, my output cannot be converted to another format or cut/copied from a PDF viewer. To demonstrate, please feel free to download my resume: http://www.tqis.com/drive/tqis/0B4ZeuWCdyYETNzU3U0lFV3d2TGc/. Then try to open it in something like LibreOffice Writer

Any ideas? Thanks!

Jim

Replies are listed 'Best First'.
Re: exporting PDF::API2
by CountZero (Bishop) on Jun 20, 2013 at 18:07 UTC
    You forgot to mention Perl in your list of technical skills!

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      That's really funny.

      Here in unemployed Michigan, there's a state agency that performs resume consulting. The resume is about a year old, and I only remembered a couple days ago that the consultant thought perl was a liability on my resume.

      One of the keynotes at YAPC-NA 2013 was a talk entitled Perl- The Detroit of Scripting Languages. Lots of fun for analogies- in this case, the State of Michigan hates both. ;)

        Here in unemployed Michigan, there's a state agency that performs resume consulting. The resume is about a year old, and I only remembered a couple days ago that the consultant thought perl was a liability on my resume.
        With state agencies giving such nonsensical advice, no wonder there is much unemployment in their state.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: exporting PDF::API2
by tqisjim (Beadle) on Jun 20, 2013 at 17:38 UTC
    FWIW, I tried using CAM::PDF:
    $pdf = new CAM::PDF 'Jim Schueler.pdf' ; print $pdf->getPageText(1) ;

    The results are about the same.

    -Jim

      Also got the same results using PDF::API3::Compat::API2

      .

      I notice that CAM::PDF has a canCopy() method that sets permission to copy. Have you tried it?

Re: exporting PDF::API2
by pvaldes (Chaplain) on Jun 22, 2013 at 10:28 UTC
    K0V6wNoH.pdf: 407386 bytes, can-print yes, can-modif yes, can-copy yes, can-add yes

    (You can find the script utilized in my blog)

    So the problem here is not that your pdf can't be copied (or opened to read). To lose the select/copy text ability is a common trouble when a pdf is repaired/fixed with some programs.

    In short: a pdf can be optimized if you convert it again to ps, and then you use ghostscript (or ps2pdf) to recreate it. You obtain a smaller archive but you can lose more modern features and things like text selection, (typically in the first pass). The cause avoiding you to select a text can be, i.e., in the encoded fonts utilized.

    You can fix this with Adobe software (or probably messing with the inner structure). Take a look to the documentation section of the CAM-PDF module from Chris Dolan with a lot of useful (and better) perl scripts to analize your pdf

      In some PDFs, especially ones created from lets say Illustrator to Acrobat through Adobe Distiller. Each letter of text gets flattened into a color filled polygon/beizer curves. There is no text in the COS tree of the PDF, just postscript polygon draw operators. I think OCR is the only way to get back computer meaning of the text. A WAG says since it all came from one font in a vector graphics program, you could try to programatically checksum each polygon against a known checksum of the polygon of each letter which was human IDed. I would look for a library that does this already, implementing on your own is insane.
Re: exporting PDF::API2
by tqisjim (Beadle) on Jun 21, 2013 at 20:58 UTC
Re: exporting PDF::API2
by Beechbone (Friar) on Jun 25, 2013 at 10:23 UTC

    try:

    $font = $pdf->ttfont($fontfile, -encode => 'udf8', -isocmap => 1, -unicodemap => 1);

    That's what I use to get copy&paste-able PDFs.


    Search, Ask, Know

      Cutting and pasting now works in my PDF file. But C&P did not work in your proposed solution: I had to s/udf/utf/.

      I did a little additional testing: Of the flags in your response, only the -unicodemap seems required. Although the utf8 encoding flag's functionality is clear, the other two flags do not effect embedded text.

      While digging in the source code, I discovered that PDF::API2 sets -unicodemap by default- although PDF::API3::Compat::API2 does not. Sure enough, when I went back and tested with PDF::API2, the text was embedded. Maybe this *is* a bug in PDF::API3::Compat::API2. Even so, I only encountered it trying to fix a bug I didn't have in the first place. :(