http://www.perlmonks.org?node_id=803201

r1n0 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,
I am posting this question in hopes of gaining some knowledge from someone that may have experimented with a solution for this at some point in their past. Also, I hope the responses will help other fellow monks when/if they seek this knowledge. I have over 100,000 files to look at, and each is a text file (.txt). What I want to do is reduce a lot of individual scripts that each perform their own REGEX against each file for a given phrase... into a single script that will do the same thing, but quickly, efficiently, and with something better than a loop through all the items or a huge REGEX with all the items (phrases) crammed one after the other. Ideally, I would like to use a hash of all the phrases being searched. I would like to know if there is a way to use a hash inside of a REGEX and make sure all the punctuation is understood (sorta like doing a quotemeta prior to running the regex). I am open to any method anyone has tested that they know will work. Over time, my list of files are going to grow and so are the phrases I will be searching for. I am guessing the phrases will increase into the thousands over time.

A sample phrase could be something like:
--->The cow jumped over the moon
-or-
--->110 Main Street, Huntington, AL, 55555

I sincerely appreciate any recommendations.

Replies are listed 'Best First'.
Re: REGEX or Not to REGEX for many items
by Your Mother (Archbishop) on Oct 26, 2009 at 04:33 UTC

    A way to do this would be to use KinoSearch (use the dev branch, it's nicer and no show stopping problems or bugs I've seen so far). The thing here being that you'll pay a set-up cost but once set-up (100,000 big docs might take anywhere from a minute to a half hour to index depending on hardware, content, and preprocessing needs), adding documents to the index is quite cheap and look-ups are blazing fast.

    You can construct phrase (in quotes) or term queries much like Google or other search engines (KS is based on Lucene).

    It's not entirely easy to grok at first and the docs are incomplete but the author is a monk (creamygoodness) and very smart, friendly, and helpful on the KS list which has a decent archive to look through too.

      sflitman/Your Mother,

      Thank you very much for your responses. I have used KinoSearch in the past for creating an index and a query engine against that index for another project. I like the idea, but this will still require running a looped lookup routine, correct? Maybe the method I want to use doesn't exist within perl, but maybe I need to use a DB with triggers or something. I will give the KinoSearch idea a try. I am using this to go through log files, which is cool. End goal was going to be to do the entire "search string list" against each log file as they are pulled into the system, but for info you supplied, I will just wait until all logs are brought in and go against them all at once. This will change my thinking but should work fine.

      I have never used KinoSearch to index Word files. I like that idea, too. Is there a site that exists that might tell one how to index all kinds of files with KinoSearch? I guess lots of tools are required based on the various filetypes that need to be converted to text. Actually, I am wondering, now, if there is a perl module that would help with converting all kinds of file types to text for KinoSearch ingestion. Something that could be used to convert PDF, Word, PowerPoint, Excel, OCR Graphics, etc, and turn them into text for KinoSearch indexing. Now that would be really cool. Anyone have any knowledge of such a module/tool/project?

      Thanks again for the info.
Re: REGEX or Not to REGEX for many items
by sflitman (Hermit) on Oct 26, 2009 at 04:29 UTC
    This is a perfect project for KinoSearch, it is very fast and flexible. I have indexed nearly 100,000 Word documents with it, after filtering them using Antiword.

    HTH,
    SSF

Re: REGEX or Not to REGEX for many items
by GrandFather (Saint) on Oct 26, 2009 at 04:16 UTC

    I can't help thinking you should ask Google how to perform this sort of task. Rapid lookup of mega-large databases is their bread and butter.

    Speaking of databases, that's probably a part of the solution. Distil the data into a suitably indexed database, then use that to perform the searches. Maybe if you tell us something of the why we can give somewhat more focused advice. If you've actually tried to solve the problem yourself you might like to tell us what you've tried and where it came unstuck.


    True laziness is hard work
      GrandFather,
      Thank you very much for your response. About asking Google... I have thought of that, but like you said, that is one of their bread-'n-butter items, and I don't think they are just going to hand it over. :-) I promise to keep the monks up on whatever the solution turns out to be. I have lot's of log files from various websites/ftp servers, and I am seeking a method to identify certain requests people have made. Some of the information is being looked up against header request data and other info is being looked up against content requested. This can apply to so many other things, too. I have lots of ideas for use, once I come up with something that works well. Shoving stuff into a DB has been a thought and use of triggers. Thanks again.
Re: REGEX or Not to REGEX for many items
by Jenda (Abbot) on Oct 26, 2009 at 15:21 UTC

    If you do not want to go with a fullblown search engine, you might want to have a look at Regex::PreSuf. It will let you combine your list of phrases into one fairly efficient regexp. Not sure how will the module and then Perl handle thousands of phrases, though.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: REGEX or Not to REGEX for many items
by dmlond (Acolyte) on Oct 26, 2009 at 14:20 UTC

    I cant comment on KinoSearch, as I have never used it myself. It sounds like it is an ideal tool for what you are wanting to do. That being said, given that you are starting from the position of having a bunch of perl scripts, each having its own REGEX defined in them, it would probably be an improvement for you in the short term (at least from a manageability standpoint) to simply coalesce these scripts into a single script defining the REGEXes into an array of qr/$term/ entries, and loop over this array for each file.

    @matches = (qr/term1/, qr/term2/, qr/term3/,...);

    You might also find Ack useful, as you might be able to program your list of terms against an instance of an Ack object.

Re: REGEX or Not to REGEX for many items
by gregor42 (Parson) on Oct 26, 2009 at 18:48 UTC

    If you are of a mind to roll your own - you might have a look at N-gram theory.

    IIRC, this was applied by the FAST search engine as the basis for their search algorithms.



    Wait! This isn't a Parachute, this is a Backpack!
Re: REGEX or Not to REGEX for many items
by Argel (Prior) on Oct 26, 2009 at 20:21 UTC
    It sounds like performance is a big concern. if you expect to be processing tons of log files int he future then you may want to avoid reinvienting the wheel and instead go with something like Splunk. It uses the Perl Compatible Regular Expression (PCRE) library so creating your search queries shouldn't be too hard (and you can even do soem uick testing in Perl 5.10). It also has tool to help build quesries and you can create views, etc.

    Elda Taluta; Sarks Sark; Ark Arks