Re: Profiling regular expressions

in reply to Profiling regular expressions

This seems like an ideal application for Filter::Simple. Here's what I came up with.

The filter package

package My::Filter;
use Filter::Simple;
use Benchmark::Timer;

our $t = Benchmark::Timer->new();

FILTER_ONLY
    regex        => sub {
            $_ = "(?{{ \$My::Filter::t->start('$_') }})"
                 . "(?:$_)"
                 . "(?{{ \$My::Filter::t->stop('$_') }})";
        },
;

sub report{ return $t->report() }

1;
[download]

This requires that you add one line at the top of your program, and one at the bottom/end to generate the report. It also requires the installation of Filter::Simple which has a dependancy on Text::Balanced, but they are both pure-Perl modules and install easily. It also uses benchmark::Timer, but I think this is a component of most standard distributions.

The test program

#! perl -slw
use strict;
use My::Filter; #<< Include the module 

my $stuff = 'abcdefghijklmnopqrstuvwxyz';
for (1..1000) {
    if ( $stuff =~ m[pqr] ) {
        $stuff =~ s/(\G(?:.{3})+?)(?<=...)(.)(.)/$1$3$2/g;
    }
    $_ = $stuff;
    my $OK = 1 if m[pqr];
}
print $stuff;

print '=' x 20, 'Timing of regexs in ', $0, '=' x 20;
print My::Filter::report(); # << Generate report
[download]

Some sample output

C:\test>test
abcdefghijklmnopqrstuvwxyz
====================Timing of regexs in C:\test\test.pl===============
+=====
2000 trials of pqr (290.000ms total), 145us/trial

5000 trials of (\G(?:.{3})+?)(?<=...)(.)(.) (820.000ms total), 164us/t
+rial
[download]

Note: This is tested as far as you see it above. I intend to do much more testing and maybe package it up for CPAN if it proves usable and useful, but that may take some time.

A couple of caveats.

Filter::Simple seems to have trouble with s/// if you use two set of identical, balanced delimiters eg. s[....][...]. If your using this style you may have to change your source slightly. Other limitations like this are bound to exists.

The filter actually embeds the timer code at the front and back of the regex itself. Even though the code is embedded using zero-width code assertions, it's quite possible--even likely--that their presence may change the meaning of some regexes. I haven't encountered one yet, but it could. If the output of your code changes, wrap the line suspected in no My::Filter; use My::Filter;. I haven't had occasion to tested this work-around yet.

It's worth pointing out that the code profiled is the regex itself. Not the statement it is a part of, nor even the whole s///. Only the left-hand side of these statements will be profiled. Hopefully, this is the most useful information anyway. It does seem to count and profile each iteration of those regexes using the /g modifier successfully.

Relating the regex back to the source code is currently a manual effort. Unfortunately, when the code is evaluated, the __LINE__ macro is not set:(. I haven't thought of a work-around for this yet. This has the unfortunate side-effect that Lexically identical regexes in different lines of the source get counted and timed as the same thing. This is usually fairly easy to work around, wrapping one of the m[pqr]'s in non-capturing brackets for instance m[(?:pqr)] will in most if not all cases, make no difference to the function of the regex, but allow them to be distinguished in the timings.

I haven't tested this with qr[...] style regexes yet.

If anyone has any suggestions for determining the line numbers at which the regexes appear, I'd be pleased to hear of them. Or anything else for that matter.

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

In Section Seekers of Perl Wisdom