Extracting tables from PDF

on Jul 12, 2007

I need to get several tables out of a large pdf document and do some processing on them. The processing is easy enough, but I'm thoroughly stumped on how to access the tables in the first place. Modules and command-line tools for creating PDFs abound. Getting the data back out, not so much.

I've checked CPAN and found the PDF and Text::PDF modules, but both are pretty sparse on documentation. I think one or both may be able to do it, but if the docs are unclear about if they can do it, then they're even less helpful in figuring out how to do it.

Any suggestions on how I might be able to accomplish this?

Re: Extracting tables from PDF
by aquarium (Curate) on Jul 13, 2007
    pdf2html sourceforge project may be of use here, as it handles tables properly
    the hardest line to type correctly is: stty erase ^H
      Thanks for the pointer. The project name is actually pdftohtml, rather than the digit 2, but close enough to find easily.

      Unfortunately, this is still pretty ugly... Tables do end up displaying properly in "complex document" mode, but that's just because it puts every element in a <div> and positions it with style=position:absolute. Whether it's in normal mode or complex document mode, there's nary a <table> tag in sight.

      I also found a message in one of the project forums where the author tells someone else,

      There is no concept of tables in PDF. When you see a table in a PDF file, it's just a bunch of text positioned in particular places and a bunch of lines. There is no simple way to translate tables from PDF to HTML or anything else.
      Granted, the post was from mid-2004, but, unless that's changed, this looks very not-promising.
        That's exactly right; PDF describes how to position elements on the page, but it doesn't have any built-in concept of a 'table'. As a result, any PDF writer is free to do anything from create its own table command to individually positioning each character in the table in any order, with arbitrary commands in between each table entry.

        The formatting commands above actually mimic the most common PDF code fairly well.
        That wouldn't have changed. PDF is not html.

