bmac has asked for the wisdom of the Perl Monks concerning the following question:

I am new to Perl and have been assigned the task of creating a search engine. I have it working with text files...however, the files are pdf's. How do I read the contents of the pdf file into $string? Thanks

Replies are listed 'Best First'.
Re: PDF Text
by leocharre (Priest) on Jun 12, 2008 at 18:38 UTC

    Check out my PDF::OCR. If you send me any feedback or requests, I tend to revise quickly.

    I also have FileArchiveIndexer, which does exactly what you mention, indexing pdf files content via ocr or text. I would really love to work with someone else to get it to production level.

    It works well, lets you sync various machines on one network to index etc.. Which you will need if you're doing ocr on 60k docs.

Re: PDF Text
by MidLifeXis (Monsignor) on Jun 12, 2008 at 18:04 UTC

    Do a search on CPAN to see if you find anything useful there. PDF::CAM seems to have a couple of functions that might work.

    Extracting the layout from a PDF files into a text file might still be problematic. It will be problematic if the page does not contain text at all, but contains a graphic image of a page instead. You would need to use some sort of OCR solution then.


Re: PDF Text
by marto (Cardinal) on Jun 12, 2008 at 18:10 UTC
Re: PDF Text
by TGI (Parson) on Jun 12, 2008 at 18:45 UTC

    Why write your own when you could use something like SWISH-E or ht://dig?

    TGI says moo

Re: PDF Text
by radiantmatrix (Parson) on Jun 12, 2008 at 20:36 UTC

    Why would you write this at all? There are a number of pre-existing solutions to searching for information inside PDFs; Google's Search Appliances, for example. Most of these solutions allow you to search quickly inside many types of document. It's got to be cheaper to buy an appliance than to spend your time building a search engine... especially since you're new to Perl.

    Searching is harder than it looks: let someone with way more resources than you solve the problem, and just use their solution!

    Ramblings and references
    “A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort.” Herm Albright
    I haven't found a problem yet that can't be solved by a well-placed trebuchet

      Indexing and searching should be attacked as *very* separate problems. For example in my situation, there's not much out there to turn a few gigs of raw paper document scans into a searchable database.

      So my focus is on hacking together indexing (Hence FileArchiveIndexer)- The search is iffy- but it's wide open to someone to reach in and work with it.

      I agree completely, searching is hard as all heck- there are a lot of ways to do it.

      You can't do a project like this thinking 'indexing and searching pdf files'- you'll go ape with the details- sounds simple.. but.. oh boy oh boy :-)

      I wouldn't discourage writting things like these from scratch- I would advise against it if possible.. but.. Shucks.. maybe this hacker will come up with something interesting. Or at least be humbled out of the ryo idea next time !

Re: PDF Text
by hesco (Deacon) on Jun 13, 2008 at 02:24 UTC
    I've not used it, but will underscore the recommendation for swish-e, based on what I've heard about it.

    But to answer your specific question, I use pdftotext to extract the ascii text from a compliant pdf file. Its a bash command line tool which is distributed with the xpdf reader application in many linux distributions. It won't work on scanned images (for which that PDF::OCR sounds particularly interesting; I'll have to check that out, ++ and thanks!). But for folks who export editable documents to PDF, it works like a charm (though is challenged a bit by multi-column content).

    -- Hugh

    if( $lal && $lol ) { $life++; }

      Something really interesting that happened at my office..

      We scan in a lot of documents. Now, the machines *are* able to encode OCR into the pdf document created. This makes indexing the documents relatively easy.

      BUT - Guess what! They don't want to use the scanner's OCR tech! Because they say it slows down scanning! And- well for five pages who cares. But for 200 page documents???

      They have a point.

      So I have my thing run at night.. collect info etc.
      That's why I needed muscle.