I've been contemplating the state of the art of manipulating PDFs in Perl. The field is littered with the corpses of CPAN modules that try to make it easy to work with PDFs, but I settled on two as being the most useful: PDF::API2 and CAM::PDF. I welcome anyone's comments pointing out things I've missed or other useful tidbits.

My original motivation was a project in which I needed to input an existing PDF (generated by some unknown method) and prepend a coversheet containing a barcode derived from some metadata (passed in as separate arguments; not from the file itself). The barcodes are so people can fax them back to me and I can route the documents, but that's a different story.

If you like counting pixels and keeping track of text's baseline and things like that, you'll love PDF::API2. It's meant to be a low-level tool, and if you want very fine-grained control of your layouts, it's the tool for you. The best examples I found are

(that's an amazingly short list for such a complicated package, but "lack of examples" seems to be a common complaint). The other tool I've used for building PDFs is wkhtmltopdf but it's not Perl. If you're not above system calls, though, it's not bad.

As a low-level tool for creating PDFs, PDF::API2 is everything I want. For reading PDFs, my experience is a bit more mixed. There is a known issue with some features of PDF 1.5 and up. That is a problem for my project, because I consume PDFs people make and "please go back and save this as version 1.4" isn't an option.

To manipulate existing PDFs, CAM::PDF works fine. As of version 1.58, it doesn't claim to broad support for PDF versions beyond 1.5, but my experience is that it can read any PDF I've thrown at it. It bills itself as a PDF manipulation library, and it can do all the helpful things like rearrange pages, import pages from another document, and even clever tricks like swapping out one image for another. So if you have a document and want to tweak it or learn about it, CAM::PDF is a good choice.

In our particular case, we combined the two. We use PDF::API2 to create a one-page coversheet document, then use CAM::PDF to prepend it to the original. It's early days, and nobody is trying to mess me up with complicated PDFs yet, but so far it seems to be working out nicely.

Comment on State of the art of PDFs in Perl
Re: State of the art of PDFs in Perl
by flexvault (Parson) on Oct 27, 2012 at 16:42 UTC


    I use PDF::API2 and agree with your comments. I will try CAM::PDF in the future to see how it can expand my tools.

    I'm guessing your on Windows, since most versions of *nix have 'pdftohtml', but a tool that I use and install on Windows may help you. It's call 'PDFCreator' and I've used it for years. Available at: http://sourceforge.net/projects/pdfcreator/

    Quote from the site: "PDFCreator easily creates PDFs from any Windows program. Use it like a printer in Word, Excel or any other Windows application. A PDF takes less storage space, and is easier to send with email. Make PDF creator part of your the software suite you have installed on your computer for easy PDF creation."

    I have to agree with the quote. It makes it easy to print a pdf version of your document, and then send it as an attachment to email.

    "Well done is better than well said." - Benjamin Franklin

Re: State of the art of PDFs in Perl
by kcott (Abbot) on Oct 27, 2012 at 17:15 UTC

    While this may be of no use, your statement, "The field is littered with the corpses of CPAN modules ...", struck a chord. A couple of years ago I was tasked with creating a pure Perl solution for creating PDF documents from scratch. After hacking my way through a graveyard of undead (much as you appear to have done) I found PDF::API2::Simple. I managed to achieve my goal with this module.

    The documentation for this module starts off with a claim by the author that he has telepathic powers: "Take note that PDF coordinates are not quite what you're used to.". He then goes on the explain the Cartesian Co-ordinate system with jaw-dropping incredulity.

    -- Ken

      He then goes on the explain the Cartesian Co-ordinate system with jaw-dropping incredulity.

      With all due respect, cartesian is kinda rare in computers, 95% of GUI toolkits deal with screen coordinates, where 0,0 is top left corner

        My post was intended to be somewhat light-hearted with references to corpses, graveyards and undead. My main bone of contention (in the second paragraph) was with the author stating what the reader was used to: the author is in no position to make such a claim and, frankly, anyone who's ever seen a graph of sales/growth/whatever over time will be familiar with the general principle of Cartesian coordinates (even if they don't know that's what they're called). I have no disagreement with your comment about GUI toolkits.

        -- Ken

      Does this really read as "jaw-dropping incredulity" to you:

      Take note that PDF coordinates are not quite what you're used to. The coordinate, (0, 0) for instance is at the lower-left hand corner. Thus, x still grows to the right, but y grows towards the top.

      To me, it reads like a short, simple, head's up that just might save a few people from tearing their hair out.

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Gosh! My comments have elicited far more response than I anticipated. :-)

        I did get the sense that the message being conveyed was: Hey, this is really weird but true - believe it or not, it does this ...

        Being told by the author what I was used to, certainly rubbed me up the wrong way; so, I've considered whether that coloured how I read the remainder of the line. Having revisited the text, I can't say that my opinion has been swayed. We'll just have to agree to disagree on that one.

        In closing, I might just point out that I was not denigrating the module as a whole: I did say that I used it to successfully complete the task. My comments (in the second paragraph) only refer to one line in the documentation.

        -- Ken

Re: State of the art of PDFs in Perl
by moritz (Cardinal) on Oct 27, 2012 at 18:49 UTC

    For creating PDF files from scratch, I usually go the route of generating LaTeX files first (Template::Plugin::Latex can be helpful for that), and then run them through pdflatex.

    This is mostly because I don't want to spend too much time on layout questions, and most (all?) of the PDF manipulation CPAN modules force me to do that.

    When the files dominated by graphics (and not text), sometimes it's easier to generate SVG first. Since SVG is an XML format, tooling support is quite good. Then inkscape (good quality, but slow to start up) or svg2pdf (using cairo as a backend; much faster, usually good enough) can be used to turn the SVG file into PDF.

Re: State of the art of PDFs in Perl
by TGI (Vicar) on Oct 27, 2012 at 20:18 UTC
    If you don't mind shelling out, the pdftk tool kit is pretty darn nifty. It lets you merge, split, rotate and so forth PDFs from the command line.

    TGI says moo

      Funnily enough, about a year ago I replaced some code (not in perl) that shelled out to pdftk, with code that shelled out to a pair of perl scripts (one PDF::API2 and one Text::PDF::File). Partly because the perl solution was faster. And partly to cut down on the amount of Java crap I had to deal with on the server :)

Re: State of the art of PDFs in Perl
by tyldis (Initiate) on Oct 29, 2012 at 18:24 UTC
    I decided to go with XSL:FO, having Apache FOP do the generation. Perl creates the necessary XML and SVG fits perfectly inside them without modification. Generation is fast, and we also use the generated XML for other stuff as well, having one single uniform data source.
Re: State of the art of PDFs in Perl
by Anonymous Monk on Nov 29, 2012 at 18:13 UTC

    I tried PDF::API2, CAM::PDF, and PDF::Reuse to implement an option for my existing PDF writer software to solve the task of embedding other PDFs into the file that is produced.

    PDF::API2 and PDF::Reuse have problems with newer PDF versions, as stated above in this thread.
    My results with CAM::PDF are good; now I am looking for a modification of this module to embed pages from other documents into pages of the generated document instead of replacing them.

    Besides, PDF::Reuse should be modified to be object oriented, or at least to not export all its functions by default. It relies on internal global variables what should by taken into account when processing several files. It's well written, though.

    I as well consider using ghostscript; maybe it helps downgrading newer PDFs as it helps breaking document protection - another problem you might encounter trying to append external PDFs to files your scripts create.

    The PDF writer software I use is authored by myself and pretty lengthy as I basically implemented a browser in Perl, reading HTML templates, creating SQL queries from these, downloading data and images from the database (using HTTP/mod_plsql) and doing all the box layout calculation. PDF generation is finally done with PDFlib.
    (Nowadays, I might favor using XSL:FO instead of doing all the calculation myself - too long a way to go.)

    So post processing of PDF files maybe challenging, or require commercial software to handle them if you can't allow for PDF version limitations.