http://www.perlmonks.org?node_id=803213


in reply to REGEX or Not to REGEX for many items

A way to do this would be to use KinoSearch (use the dev branch, it's nicer and no show stopping problems or bugs I've seen so far). The thing here being that you'll pay a set-up cost but once set-up (100,000 big docs might take anywhere from a minute to a half hour to index depending on hardware, content, and preprocessing needs), adding documents to the index is quite cheap and look-ups are blazing fast.

You can construct phrase (in quotes) or term queries much like Google or other search engines (KS is based on Lucene).

It's not entirely easy to grok at first and the docs are incomplete but the author is a monk (creamygoodness) and very smart, friendly, and helpful on the KS list which has a decent archive to look through too.

  • Comment on Re: REGEX or Not to REGEX for many items

Replies are listed 'Best First'.
Re^2: REGEX or Not to REGEX for many items
by r1n0 (Beadle) on Oct 26, 2009 at 12:23 UTC
    sflitman/Your Mother,

    Thank you very much for your responses. I have used KinoSearch in the past for creating an index and a query engine against that index for another project. I like the idea, but this will still require running a looped lookup routine, correct? Maybe the method I want to use doesn't exist within perl, but maybe I need to use a DB with triggers or something. I will give the KinoSearch idea a try. I am using this to go through log files, which is cool. End goal was going to be to do the entire "search string list" against each log file as they are pulled into the system, but for info you supplied, I will just wait until all logs are brought in and go against them all at once. This will change my thinking but should work fine.

    I have never used KinoSearch to index Word files. I like that idea, too. Is there a site that exists that might tell one how to index all kinds of files with KinoSearch? I guess lots of tools are required based on the various filetypes that need to be converted to text. Actually, I am wondering, now, if there is a perl module that would help with converting all kinds of file types to text for KinoSearch ingestion. Something that could be used to convert PDF, Word, PowerPoint, Excel, OCR Graphics, etc, and turn them into text for KinoSearch indexing. Now that would be really cool. Anyone have any knowledge of such a module/tool/project?

    Thanks again for the info.