http://www.perlmonks.org?node_id=391464


in reply to Re: Matching against list of patterns
in thread Matching against list of patterns

Tachyon,
I have a similar problem, over six hundred complex regexen to match against a busy logfile and related messages to issue depending on which regexps were matched. much the same problem as the OP

If I understand the regexp engine caching the compiled version of a regex if it is not going to change then I think this should be a reasonably efficient approach. Am I on the right tracks ? and is the /o unrequired as I have already interpolated the variable when the regex is first called ?

#!/usr/local/bin/perl -w use strict; my ($i, $compile_me, @names)=(1, "{my \@matches;", "no match"); while (<DATA>) { next if /^\s*$/; last if /END CONFIG/; chomp; my ($name, $reg)=split; push @names, $name; $compile_me.="push \@matches, $i if /$reg/o;"; $i++; } $compile_me.="\@matches}"; while (<DATA>) { chomp; print "\nmatches found for $_\n"; my @matches=eval $compile_me; foreach (@matches) { print $names[$_] , "\n" } } __DATA__ Fred_and_Friends fr.d Paul_and_co paul some_numbers \d{2} freud_likes_fred fr END CONFIG freud fred NaNa pauline 12312sdfsdf 2
Update with speed test

I have now run a comparative test over 300^H^H^H, sorry 416 lines of log, with my 672 pattern matches. First using the eval of a string containing all the regexen and returning match index numbers as above. Second is my old naive code holding an array with the regexen and doing a foreach through it against each line. I did not use the /o for the reasons given above it works fine without it

>time ./Monitor.fast real 0m1.49s user 0m0.68s sys 0m0.58s >time ./Monitor real 0m19.47s user 0m14.69s sys 0m0.50s >

I think the numbers speak for themselves

Cheers,
R.

Replies are listed 'Best First'.
Re^3: Matching against list of patterns
by Eyck (Priest) on Sep 17, 2004 at 07:09 UTC

    Let me try decompile/understand what's going on here,

    One obvious improvement is using /o, right?

    The other is unwinding foreach loop into linear list of matches, what is the gain in that? We avoid walking the array, but I never thought that operation is that costly?

    Is this what's going on or am I missing something?

      When you have code like this

      # we read a line from somewhere to $line # and @regex is our array of patterns foreach (@regex) { print "Match!\n" if $line=/$_/; }
      Each time the regex is run it has a different pattern so has to compile the pattern again. I belive compiling the pattern can be quite costly for complex regexen. However what I have done is first to expand all the patterns into a code block
      # { # my @matches; # push @matches 1 if /first_regex/; # push @matches 2 is /second_regex/; # . # . # push @matches n if /nth_regex/; # @matches; # }
      This code block is stored in a scalar and eval'd. It runs against the line stored in $_ and returns a list of indices telling us which regex was matched. Now Perl can see that it has a series of invariant regexen and after the first eval caches the compiled regexen and uses this version on subsequent itterations. the problem is still n*m omplexity but we have stopped perl doing rather a large amount of work in each itteration. Certainly for my problem here comparing over 400 lines of log to over 600 reasonably complex regexen the speed up was >10 times.

      Cheers,
      R.

        Hmm, well, I still don't understand how unwinding array of regexpes can give 10fold improvement.

        I re-run your code against a bit different body of data (20 regexp, 5k lines), and results are a bit different:

        Array of regexpes unwinded, with /o:
        28.97s user 0.09s system 79% cpu 36.674 total
        Array of regexpes unwinded, without /o:
        29.95s user 0.04s system 95% cpu 31.481 total
        
        foreach loop, without /o:
        2.61s user 0.00s system 100% cpu 2.595 total
        foreach loop, with /o:
        0.33s user 0.01s system 17% cpu 1.957 total