Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Matching against list of patterns

by Random_Walk (Parson)
on Sep 16, 2004 at 14:26 UTC ( #391464=note: print w/ replies, xml ) Need Help??


in reply to Re: Matching against list of patterns
in thread Matching against list of patterns

Tachyon,
I have a similar problem, over six hundred complex regexen to match against a busy logfile and related messages to issue depending on which regexps were matched. much the same problem as the OP

If I understand the regexp engine caching the compiled version of a regex if it is not going to change then I think this should be a reasonably efficient approach. Am I on the right tracks ? and is the /o unrequired as I have already interpolated the variable when the regex is first called ?

#!/usr/local/bin/perl -w use strict; my ($i, $compile_me, @names)=(1, "{my \@matches;", "no match"); while (<DATA>) { next if /^\s*$/; last if /END CONFIG/; chomp; my ($name, $reg)=split; push @names, $name; $compile_me.="push \@matches, $i if /$reg/o;"; $i++; } $compile_me.="\@matches}"; while (<DATA>) { chomp; print "\nmatches found for $_\n"; my @matches=eval $compile_me; foreach (@matches) { print $names[$_] , "\n" } } __DATA__ Fred_and_Friends fr.d Paul_and_co paul some_numbers \d{2} freud_likes_fred fr END CONFIG freud fred NaNa pauline 12312sdfsdf 2
Update with speed test

I have now run a comparative test over 300^H^H^H, sorry 416 lines of log, with my 672 pattern matches. First using the eval of a string containing all the regexen and returning match index numbers as above. Second is my old naive code holding an array with the regexen and doing a foreach through it against each line. I did not use the /o for the reasons given above it works fine without it

>time ./Monitor.fast real 0m1.49s user 0m0.68s sys 0m0.58s >time ./Monitor real 0m19.47s user 0m14.69s sys 0m0.50s >

I think the numbers speak for themselves

Cheers,
R.


Comment on Re^2: Matching against list of patterns
Select or Download Code
Re^3: Matching against list of patterns
by Eyck (Priest) on Sep 17, 2004 at 07:09 UTC

    Let me try decompile/understand what's going on here,

    One obvious improvement is using /o, right?

    The other is unwinding foreach loop into linear list of matches, what is the gain in that? We avoid walking the array, but I never thought that operation is that costly?

    Is this what's going on or am I missing something?

      When you have code like this

      # we read a line from somewhere to $line # and @regex is our array of patterns foreach (@regex) { print "Match!\n" if $line=/$_/; }
      Each time the regex is run it has a different pattern so has to compile the pattern again. I belive compiling the pattern can be quite costly for complex regexen. However what I have done is first to expand all the patterns into a code block
      # { # my @matches; # push @matches 1 if /first_regex/; # push @matches 2 is /second_regex/; # . # . # push @matches n if /nth_regex/; # @matches; # }
      This code block is stored in a scalar and eval'd. It runs against the line stored in $_ and returns a list of indices telling us which regex was matched. Now Perl can see that it has a series of invariant regexen and after the first eval caches the compiled regexen and uses this version on subsequent itterations. the problem is still n*m omplexity but we have stopped perl doing rather a large amount of work in each itteration. Certainly for my problem here comparing over 400 lines of log to over 600 reasonably complex regexen the speed up was >10 times.

      Cheers,
      R.

        Hmm, well, I still don't understand how unwinding array of regexpes can give 10fold improvement.

        I re-run your code against a bit different body of data (20 regexp, 5k lines), and results are a bit different:

        Array of regexpes unwinded, with /o:
        28.97s user 0.09s system 79% cpu 36.674 total
        Array of regexpes unwinded, without /o:
        29.95s user 0.04s system 95% cpu 31.481 total
        
        foreach loop, without /o:
        2.61s user 0.00s system 100% cpu 2.595 total
        foreach loop, with /o:
        0.33s user 0.01s system 17% cpu 1.957 total
        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://391464]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2015-07-03 20:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (56 votes), past polls