Why would you write this at all? There are a number of pre-existing solutions to searching for information inside PDFs; Google's Search Appliances, for example. Most of these solutions allow you to search quickly inside many types of document. It's got to be cheaper to buy an appliance than to spend your time building a search engine... especially since you're new to Perl.

Searching is harder than it looks: let someone with way more resources than you solve the problem, and just use their solution!

by leocharre (Priest) on Jun 12, 2008 at 21:09 UTC

    Indexing and searching should be attacked as *very* separate problems. For example in my situation, there's not much out there to turn a few gigs of raw paper document scans into a searchable database.

    So my focus is on hacking together indexing (Hence FileArchiveIndexer)- The search is iffy- but it's wide open to someone to reach in and work with it.

    I agree completely, searching is hard as all heck- there are a lot of ways to do it.

    You can't do a project like this thinking 'indexing and searching pdf files'- you'll go ape with the details- sounds simple.. but.. oh boy oh boy :-)

    I wouldn't discourage writting things like these from scratch- I would advise against it if possible.. but.. Shucks.. maybe this hacker will come up with something interesting. Or at least be humbled out of the ryo idea next time !