Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^3: Matching against list of patterns

by Eyck (Priest)
on Sep 17, 2004 at 07:09 UTC ( #391695=note: print w/ replies, xml ) Need Help??

in reply to Re^2: Matching against list of patterns
in thread Matching against list of patterns

Let me try decompile/understand what's going on here,

One obvious improvement is using /o, right?

The other is unwinding foreach loop into linear list of matches, what is the gain in that? We avoid walking the array, but I never thought that operation is that costly?

Is this what's going on or am I missing something?

Comment on Re^3: Matching against list of patterns
Download Code
Re^4: Matching against list of patterns
by Random_Walk (Parson) on Sep 17, 2004 at 08:40 UTC

    When you have code like this

    # we read a line from somewhere to $line # and @regex is our array of patterns foreach (@regex) { print "Match!\n" if $line=/$_/; }
    Each time the regex is run it has a different pattern so has to compile the pattern again. I belive compiling the pattern can be quite costly for complex regexen. However what I have done is first to expand all the patterns into a code block
    # { # my @matches; # push @matches 1 if /first_regex/; # push @matches 2 is /second_regex/; # . # . # push @matches n if /nth_regex/; # @matches; # }
    This code block is stored in a scalar and eval'd. It runs against the line stored in $_ and returns a list of indices telling us which regex was matched. Now Perl can see that it has a series of invariant regexen and after the first eval caches the compiled regexen and uses this version on subsequent itterations. the problem is still n*m omplexity but we have stopped perl doing rather a large amount of work in each itteration. Certainly for my problem here comparing over 400 lines of log to over 600 reasonably complex regexen the speed up was >10 times.


      Hmm, well, I still don't understand how unwinding array of regexpes can give 10fold improvement.

      I re-run your code against a bit different body of data (20 regexp, 5k lines), and results are a bit different:

      Array of regexpes unwinded, with /o:
      28.97s user 0.09s system 79% cpu 36.674 total
      Array of regexpes unwinded, without /o:
      29.95s user 0.04s system 95% cpu 31.481 total
      foreach loop, without /o:
      2.61s user 0.00s system 100% cpu 2.595 total
      foreach loop, with /o:
      0.33s user 0.01s system 17% cpu 1.957 total

        I must say I have been puzzled too..... I have now run a test with an adaption of the short code I posted and got these results (I'll stick my test code at the bottom, its rough as a badgers arse)

        root@tivpre-master:/home/robinp # ./regexps with eval "optimisation" timethis 5000: 56 wallclock secs (52.42 usr + 0.00 sys = 52.42 CPU) naive timethis 5000: 16 wallclock secs (15.63 usr + 0.00 sys = 15.63 CPU)
        however the other results I posted were for real life code where I do get a magnitude better performance using the eval method. Unfortunately I have written the code for a client and can post neither it or the regex I am using (perhaps I can sanitise a couple of example regex). I suspect that the more complex regex I use and the large number make the saving in re-compilation overcome the cost of running eval. For simpler cases and less regex doing an eval probably costs more than the recompiling overhead.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://391695]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (17)
As of 2014-07-30 19:01 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (239 votes), past polls