Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: Profiling regular expressions

by BrowserUk (Pope)
on Jan 31, 2003 at 01:46 UTC ( #231487=note: print w/replies, xml ) Need Help??

in reply to Profiling regular expressions

This seems like an ideal application for Filter::Simple. Here's what I came up with.

The filter package

package My::Filter; use Filter::Simple; use Benchmark::Timer; our $t = Benchmark::Timer->new(); FILTER_ONLY regex => sub { $_ = "(?{{ \$My::Filter::t->start('$_') }})" . "(?:$_)" . "(?{{ \$My::Filter::t->stop('$_') }})"; }, ; sub report{ return $t->report() } 1;

This requires that you add one line at the top of your program, and one at the bottom/end to generate the report. It also requires the installation of Filter::Simple which has a dependancy on Text::Balanced, but they are both pure-Perl modules and install easily. It also uses benchmark::Timer, but I think this is a component of most standard distributions.

The test program

#! perl -slw use strict; use My::Filter; #<< Include the module my $stuff = 'abcdefghijklmnopqrstuvwxyz'; for (1..1000) { if ( $stuff =~ m[pqr] ) { $stuff =~ s/(\G(?:.{3})+?)(?<=...)(.)(.)/$1$3$2/g; } $_ = $stuff; my $OK = 1 if m[pqr]; } print $stuff; print '=' x 20, 'Timing of regexs in ', $0, '=' x 20; print My::Filter::report(); # << Generate report

Some sample output

C:\test>test abcdefghijklmnopqrstuvwxyz ====================Timing of regexs in C:\test\ +===== 2000 trials of pqr (290.000ms total), 145us/trial 5000 trials of (\G(?:.{3})+?)(?<=...)(.)(.) (820.000ms total), 164us/t +rial

Note: This is tested as far as you see it above. I intend to do much more testing and maybe package it up for CPAN if it proves usable and useful, but that may take some time.

A couple of caveats.

Filter::Simple seems to have trouble with s/// if you use two set of identical, balanced delimiters eg.  s[....][...]. If your using this style you may have to change your source slightly. Other limitations like this are bound to exists.

The filter actually embeds the timer code at the front and back of the regex itself. Even though the code is embedded using zero-width code assertions, it's quite possible--even likely--that their presence may change the meaning of some regexes. I haven't encountered one yet, but it could. If the output of your code changes, wrap the line suspected in no My::Filter; use My::Filter;. I haven't had occasion to tested this work-around yet.

It's worth pointing out that the code profiled is the regex itself. Not the statement it is a part of, nor even the whole s///. Only the left-hand side of these statements will be profiled. Hopefully, this is the most useful information anyway. It does seem to count and profile each iteration of those regexes using the /g modifier successfully.

Relating the regex back to the source code is currently a manual effort. Unfortunately, when the code is evaluated, the __LINE__ macro is not set:(. I haven't thought of a work-around for this yet. This has the unfortunate side-effect that Lexically identical regexes in different lines of the source get counted and timed as the same thing. This is usually fairly easy to work around, wrapping one of the m[pqr]'s in non-capturing brackets for instance m[(?:pqr)] will in most if not all cases, make no difference to the function of the regex, but allow them to be distinguished in the timings.

I haven't tested this with qr[...] style regexes yet.

If anyone has any suggestions for determining the line numbers at which the regexes appear, I'd be pleased to hear of them. Or anything else for that matter.

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Replies are listed 'Best First'.
Re: Re: Profiling regular expressions
by Mur (Pilgrim) on Feb 04, 2003 at 20:40 UTC
    Well, I finally got around to trying this. Unfortunately, it coughs and dies on things like this:
    $tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){$s +hort_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line>)/ +$1<\/para><para>$2/gs;
    And of course I've no idea why ... Apparently the resulting source is syntactically incorrect.
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
    ...Nexcerpt...Connecting People With Expertise

      An error msg, and a small sample of test data would have been nice.

      The problem appears to be caused by the fact that when using source filters, the regex is eval'd. As your regexes contain embedded vars that require interpolation, and interpolation in eval'd regexes is prohibited by default, we need to add

      use re 'eval';

      to the program under test. I hoped that I could add it to the filter module itself, but that doesn't work. (Obvious why once you tried it but...). Anyway, adding that line to the top of the program under test and the filter seems to work fine again without modification from the version presented above.

      A quick test prog

      I'd like to suggest using the /x option on your regexes to make them a little more readable, but I tried it and whilst they still work, it has a significant effect upon the performance. Which as that's presumably what your trying to improve.

      One minor improvement to the readablility of the output report can be obtained by changing


      to $My::Filter::t->start('$/$_$/')

      Make sure your make the same change to the stop() line as well.

      I also tried a version of the filter that used a simple numbering scheme for the start/stop labels which makes the output more readable, but makes relating the number in the report back to the individual regex in the code considerably harder. Post a reply if you want a copy of that version

      I still think that if I could find a way of using the __LINE__ macro as the timer label, it would be better option than the text of the regex itself, but that doesn't work for obvious reasons.

      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      You could try a less ambitious filter that timed lines containing an "=~", something like this:
      package Filt; use Filter::Simple; FILTER { my @set = split /\n/,$_; my $new = ''; foreach my $line (@set) { if ($line =~ /\=\~/) { $new .= "\$t->start('$line');\n"; $new .= $line . "\n"; $new .= "\$t->stop();\n"; } else { $new .= $line . "\n"; } } $_ = $new; }; 1 ;
      I tried it on the following simple case, and it worked.
      use strict; use Benchmark::Timer; use Filt; our $t = Benchmark::Timer->new(); my $short_line_threshold = 2; my $short_line_counter = 1; my $long_line_threshold = 7; my $tmp = "Abcdef"; $tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){$sho +rt_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line>)/$1 +<\/para><para>$2/gs; print $t->report();

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://231487]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2020-08-12 14:53 GMT
Find Nodes?
    Voting Booth?
    Which rocket would you take to Mars?

    Results (66 votes). Check out past polls.