http://www.perlmonks.org?node_id=391421


in reply to Matching against list of patterns

The typical method is to dynamically build an alternation RE and let the RE engine do all the optimisation and hard work in lovingly hand optimised C. We use qr// so the RE is compiled once and avoid vast numbers of hash lookups, functionally useless variable assignments*, and RE compiles that you are doing above. As a general rule loops are great territory for optimisation due to the repeated execution of the loop code.

my @terms = qw( foo bar baz fool foolish ); my $re = join '|', map{quotemeta} sort {length($b)<=>length($a)}@terms +; $re = qr/($re)/; while(<DATA>){ my @matches = $_ =~ m/$re/g; next unless @matches; print "Got @matches\n" } __DATA__ Only a foobar foolish fool skips the sort by length Without this foo you look foolish and may match short elements!

Hash lookup tables may still have a place to convert a matched value into something else. Here is a trivial example. Beware /i as you need to lc($1) and use all lowercase keys in %terms or you won't get a match/lookup.

my %terms = ( foo => 'FOO', bar => 'BAR' ); my $re = join '|', map{quotemeta} sort {length($b)<=>length($a)}keys % +terms; $re = qr/($re)/i; while(<DATA>){ s/$re/$terms{lc($1)}/g; print; } __DATA__ foo is a word Bar ba blacksheep

cheers

tachyon

Replies are listed 'Best First'.
Re^2: Matching against list of patterns
by Random_Walk (Prior) on Sep 16, 2004 at 14:26 UTC

    Tachyon,
    I have a similar problem, over six hundred complex regexen to match against a busy logfile and related messages to issue depending on which regexps were matched. much the same problem as the OP

    If I understand the regexp engine caching the compiled version of a regex if it is not going to change then I think this should be a reasonably efficient approach. Am I on the right tracks ? and is the /o unrequired as I have already interpolated the variable when the regex is first called ?

    #!/usr/local/bin/perl -w use strict; my ($i, $compile_me, @names)=(1, "{my \@matches;", "no match"); while (<DATA>) { next if /^\s*$/; last if /END CONFIG/; chomp; my ($name, $reg)=split; push @names, $name; $compile_me.="push \@matches, $i if /$reg/o;"; $i++; } $compile_me.="\@matches}"; while (<DATA>) { chomp; print "\nmatches found for $_\n"; my @matches=eval $compile_me; foreach (@matches) { print $names[$_] , "\n" } } __DATA__ Fred_and_Friends fr.d Paul_and_co paul some_numbers \d{2} freud_likes_fred fr END CONFIG freud fred NaNa pauline 12312sdfsdf 2
    Update with speed test

    I have now run a comparative test over 300^H^H^H, sorry 416 lines of log, with my 672 pattern matches. First using the eval of a string containing all the regexen and returning match index numbers as above. Second is my old naive code holding an array with the regexen and doing a foreach through it against each line. I did not use the /o for the reasons given above it works fine without it

    >time ./Monitor.fast real 0m1.49s user 0m0.68s sys 0m0.58s >time ./Monitor real 0m19.47s user 0m14.69s sys 0m0.50s >

    I think the numbers speak for themselves

    Cheers,
    R.

      Let me try decompile/understand what's going on here,

      One obvious improvement is using /o, right?

      The other is unwinding foreach loop into linear list of matches, what is the gain in that? We avoid walking the array, but I never thought that operation is that costly?

      Is this what's going on or am I missing something?

        When you have code like this

        # we read a line from somewhere to $line # and @regex is our array of patterns foreach (@regex) { print "Match!\n" if $line=/$_/; }
        Each time the regex is run it has a different pattern so has to compile the pattern again. I belive compiling the pattern can be quite costly for complex regexen. However what I have done is first to expand all the patterns into a code block
        # { # my @matches; # push @matches 1 if /first_regex/; # push @matches 2 is /second_regex/; # . # . # push @matches n if /nth_regex/; # @matches; # }
        This code block is stored in a scalar and eval'd. It runs against the line stored in $_ and returns a list of indices telling us which regex was matched. Now Perl can see that it has a series of invariant regexen and after the first eval caches the compiled regexen and uses this version on subsequent itterations. the problem is still n*m omplexity but we have stopped perl doing rather a large amount of work in each itteration. Certainly for my problem here comparing over 400 lines of log to over 600 reasonably complex regexen the speed up was >10 times.

        Cheers,
        R.

Re^2: Matching against list of patterns
by Eyck (Priest) on Sep 16, 2004 at 13:42 UTC

    This is very nice, but this works well with terms, not that well with regular expressions (ie, not 'foo' but something like 'f[o|O]{1,3}' instead).

    Also, this solves little because this only tells me that my line matches one of my patterns.. that is great, but the problem here is finding WHICH re it matches.

    I look at your answer, and I think I'm not clever enough to extend this kind of idea to those requirements.

    And another thing, how flexible and scalable are perl's res, what happens when there's 1000s of ~40 chars long REs ganged together, will regexp engine/optimizer sort through that?

      Remove the map quotemeta and you can use REs to your hears delight. The only issue is then extra | in those REs. My example captures the physical match. It is a simple task to map that back to the pattern (or patterns) that match that if required. The benchmark at Re^2: Matching against list of patterns shows a full order of magnitude speed improvement so the optimiser does OK.

      cheers

      tachyon

        Note that that benchmark is against the code that still does the whole 'foreach' thing, only in different way.

        I'm sorry, but I don't understand how can I accomplish this simple task of mapping 'physical match back to the pattern or patterns that match'.

        This is the task that we're trying to accomplish at first place, so how is it 'simple' if after a bunch of lines of ingenious code we reduce the problem to itself?

        Either I'm missing something, or I described the problem in confusing way?