http://www.perlmonks.org?node_id=846761

Hessu has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

While I was playing around with Devel::NYTProf version 4.03, I started to wonder about regexp matching optimization. At least in the tests below, it seems that option /o offers better performance that precompiled regexp. There is of course variation between measurements, but the difference between /o option and precompiled regexp remains.

  1. Readonly constant with /o option - CORE:regcomp, avg 675ns/call
  2. Regexp match operator with local variable - CORE:regcomp, avg 718ns/call
  3. Use constant in match operator - CORE:regcomp avg 721ns/call
  4. Precompiled regexp - CORE:regcomp, avg 1µs/call
  5. Readonly constant in match operator - CORE:regcomp avg 6µs/call (Updated)

There are two things that I am wondering:

  1. It seems that CORE::regcomp is called everytime in the loop with variables and constants in regexp match operator. The time spend there just varies based on regexp. From What is /o really for? I first assumed that regexp compilation is made only once?
  2. It was also a surprise that option /o is actually faster, at least in this case, than precompiled regexp. Is this how it should be or am I missing something?

The tests have been made with ActiveState Perl version 5.10.1 Binary build 1006 291086 - Aug 24 2009 13:48:26. Hardware was Win7, 4GB, SSD HD, Intel Core7 920 2.6GHz.

Thank You
#!/usr/bin/perl -w ##################################################################### # Test regexp matching # # > perl -MDevel::NYTProf=savesrc=1 optimize_regexp.pl # > nytprofhtml ##################################################################### use strict; use warnings; use Cwd; use Readonly; use Path::Class qw(file dir); use Date::Calc qw(Today_and_Now); use Fcntl qw(O_WRONLY O_CREAT O_TRUNC O_RDONLY); ##################################################################### ## ## CONSTANTS ## ##################################################################### Readonly my $EMPTY => q{}; Readonly my $TOOL_ROOT => getcwd; Readonly my $TEMP_FILE_NAME => 'temp_file.txt'; Readonly my $TEMP_FILE_SIZE => 1000000; ##################################################################### ## ## MAIN ## ##################################################################### my $l_line = $EMPTY; my $l_temp_file = $EMPTY; my $l_file_h = $EMPTY; # Create temporary file that is read for tests. $l_temp_file = file($TOOL_ROOT, $TEMP_FILE_NAME); create_temp_file($l_temp_file); ##################################################################### # Readonly constant in match operator - regcomp avg 6µs/call ##################################################################### Readonly my $REGEXP_READONLY => '999986'; $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ m/$REGEXP_READONLY/ ) { # 5.78s - 1000001 calls to main::CORE:regcomp, avg 6µs/call # 4.88s - 1000001 calls to IO::Handle::getline, avg 5µs/call # 1.83s - 1000001 calls to Readonly::Scalar::FETCH, avg 2µs/call # 909ms - 1000001 calls to main::CORE:match, avg 859ns/call chomp $l_line; LOG("Regexp 01 - matched line ($l_line)"); } } $l_file_h->close(); ##################################################################### # Use constant in match operator - regcomp avg 721ns/call ##################################################################### use constant REGEXP_CONSTANT => '999986'; $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ m/${\REGEXP_CONSTANT}/ ) { # 4.83s - 1000001 calls to IO::Handle::getline, avg 5µs/call # 745ms - 1000001 calls to main::CORE:match, avg 729ns/call # 735ms - 1000001 calls to main::CORE:regcomp, avg 721ns/call chomp $l_line; LOG("Regexp 02 - matched line ($l_line)"); } } $l_file_h->close(); ##################################################################### # No constant in match operator - no regcomp called ##################################################################### $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ m/999986/ ) { # spent 4.78s - 1000001 calls to IO::Handle::getline, avg 5µs/call # spent 838ms - 1000001 calls to main::CORE:match, avg 838ns/call chomp $l_line; LOG("Regexp 03 - matched line ($l_line)"); } } $l_file_h->close(); ##################################################################### # Readonly constant with /o option - regcomp, avg 675ns/call ##################################################################### $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ m/$REGEXP_READONLY/o ) { # 4.84s - 1000001 calls to IO::Handle::getline, avg 5µs/call # 754ms - 1000001 calls to main::CORE:match, avg 754ns/call # 732ms - 1000001 calls to main::CORE:regcomp, avg 675ns/call # 0s - 2 calls to Readonly::Scalar::FETCH, avg 0s/call chomp $l_line; LOG("Regexp 04 - matched line ($l_line)"); } } $l_file_h->close(); ##################################################################### # Precompiled regexp - regcomp, avg 1µs/call ##################################################################### my $l_search_r = qr/$REGEXP_READONLY/; $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ $l_search_r ) { # 4.77s - 1000001 calls to IO::Handle::getline, avg 5µs/call # 1.33s - 1000001 calls to main::CORE:regcomp, avg 1µs/call # 776ms - 1000001 calls to main::CORE:match, avg 776ns/call chomp $l_line; LOG("Regexp 05 - matched line ($l_line)"); } } $l_file_h->close(); ##################################################################### # Regexp match operator with local variable - regcomp, avg 718ns/call ##################################################################### my $l_search = $REGEXP_READONLY; $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ m/$l_search/ ) { # 4.73s - 1000001 calls to IO::Handle::getline, avg 5µs/call # 759ms - 1000001 calls to main::CORE:match, avg 766ns/call # 741ms - 1000001 calls to main::CORE:regcomp, avg 718ns/call chomp $l_line; LOG("Regexp 06 - matched line ($l_line)"); } } $l_file_h->close(); ##################################################################### # Regexp match with variable and /o option - regcomp, avg 690ns/call ##################################################################### $l_search = $REGEXP_READONLY; $l_file_h = IO::File->new($l_temp_file, O_RDONLY); while( $l_line = $l_file_h->getline() ) { if( $l_line =~ m/$l_search/o ) { # 4.86s - 1000001 calls to IO::Handle::getline, avg 5µs/call # 758ms - 1000001 calls to main::CORE:match, avg 758ns/call # 690ms - 1000001 calls to main::CORE:regcomp, avg 690ns/call chomp $l_line; LOG("Regexp 07 - matched line ($l_line)"); } } $l_file_h->close(); exit 0; ##################################################################### ## ## SUBROUTINES ## ##################################################################### sub create_temp_file{ my $p_file = shift; my $l_file_h = $EMPTY; LOG("print file ($p_file)"); $l_file_h = IO::File->new($p_file, O_WRONLY|O_TRUNC|O_CREAT); for( 0 .. $TEMP_FILE_SIZE ) { print {$l_file_h} 'Line number is = ' . $_ . "\n"; } $l_file_h->close(); return; } sub LOG{ my $l_time = [Today_and_Now()]; my $l_string = sprintf('%d-%02d-%02d %02d:%02d:%02d', @{$l_time}); $l_string = $l_string . q{ - } . $_[0]; print sprintf("%s\n", $l_string); return; }

Replies are listed 'Best First'.
Re: Regexp optimization - /o option better than precompiled regexp?
by ikegami (Patriarch) on Jun 27, 2010 at 20:32 UTC

    Patterns without interpolation are compiled when their quoting operator (m//, s///, qr//) is compiled. /o shouldn't matter one bit for those, so I won't discuss them.

    Patterns with interpolation are compiled when their quoting operator (m//, s///, qr//) is executed.

    Perl caches compiled regex patterns to avoid needless recompilation in situations like the following:

    # Compiles each pattern once since m// realises # you're using the same pattern twice in a row. for my $re (qw( foo bar )) { for (1..2) { /$re/ } }

    A match or substitution operator can only resuse the last regex is compiled, so the following isn't efficient:

    # Compiles each pattern twice for (1..2) { for my $re (qw( foo bar )) { /$re/ } }

    You can use qr// to precompile a pattern.

    # Compiles each pattern once my @res = map qr/$_/, qw( foo bar ); for (1..2) { for my $re (@res) { /$re/ } }

    Note that Perl currently flattens and recompiles compiled patterns interpolated into another pattern.

    # Doesn't recompile $re if it's a qr//. /$re/ # Stringifies and recompiles $re if it's a qr//, # but it should be subject to the caching mentioned above. /x$re/

    This should be optimised in the future.

Re: Regexp optimization - /o option better than precompiled regexp?
by jwkrahn (Abbot) on Jun 27, 2010 at 15:00 UTC
    it seems that option /o offers better performance that precompiled regexp.

    A "precompiled regexp" is one that has no variables and is therefore compiled at compile-time, for example m/999986/ or m/^abc.*def/.    Any regexp that contains a variable has to be compiled at run-time.    The /o option means that the regexp is interpolated and compiled only once for the complete life of the program.    Any other regexp that requires variable interpolation has to be interpolated and compiled each time that regexp is used.    The qr// operator allows you to interpolate and compile a regexp once, at run-time, for example from user input.

      "The /o option means that the regexp is interpolated and compiled only once for the complete life of the program."

      But why then is main::CORE:regcomp still being called on every iteration, just as in the cases without /o?

Re: Regexp optimization - /o option better than precompiled regexp? (analysis)
by tye (Sage) on Jun 28, 2010 at 19:31 UTC

    Wow. Those results were very hard to read and understand.

    First, none of your cases seem to be recompiling a regex each time through the loop. (It appears to me that) the worst case you've included does a string compare to determine that the regex doesn't need to be recompiled (and does this each time through the loop). Clearly, CORE::regcomp() doesn't unconditionally recompile a regex (based on parsing your results, it checks some things to determine if it even needs to do a string compare, then optionally does a string compare, and then only recompiles the regex if the string compare finds a difference).

    Let's look at your results cleaned up so the interesting numbers are much easier to compare:

    Readonly my $REGEXP_READONLY => '999986'; if( $l_line =~ m/$REGEXP_READONLY/ ) { # 5.78s - CORE:regcomp # 0.91s - CORE:match # 1.83s - Readonly::Scalar::FETCH use constant REGEXP_CONSTANT => '999986'; if( $l_line =~ m/${\REGEXP_CONSTANT}/ ) { # 0.74s - CORE:regcomp # 0.75s - CORE:match if( $l_line =~ m/999986/ ) { # 0.84s - CORE:match if( $l_line =~ m/$REGEXP_READONLY/o ) { # 0.73s - CORE:regcomp # 0.75s - CORE:match my $l_search_r = qr/$REGEXP_READONLY/; if( $l_line =~ $l_search_r ) { # 1.33s - CORE:regcomp # 0.78s - CORE:match my $l_search = $REGEXP_READONLY; if( $l_line =~ m/$l_search/ ) { # 0.74s - CORE:regcomp # 0.76s - CORE:match $l_search = $REGEXP_READONLY; if( $l_line =~ m/$l_search/o ) { # 0.69s - CORE:regcomp # 0.76s - CORE:match

    Second, let's take care of the least interesting bit:

    # 0.91s - CORE:match if( $l_line =~ m/$REGEXP_READONLY/ ) { # 0.75s - CORE:match if( $l_line =~ m/${\REGEXP_CONSTANT}/ ) { # 0.84s - CORE:match if( $l_line =~ m/999986/ ) { # 0.75s - CORE:match if( $l_line =~ m/$REGEXP_READONLY/o ) { # +/o # 0.78s - CORE:match if( $l_line =~ $l_search_r ) { # +qr// # 0.76s - CORE:match if( $l_line =~ m/$l_search/ ) { # 0.76s - CORE:match if( $l_line =~ m/$l_search/o ) { # +/o

    We can see that the difference in speed of the regex matching is "in the noise". Indeed, I can think of no reason why the speeds would be any different in practice and suspect that the differences reported actually are indeed just noise. You might want to move the order of the cases around and re-run and see how the noise moves with the order of execution and/or just moves randomly. There might be an insignificant difference that isn't noise in one of those cases, but I won't waste time chasing that until I see better evidence of this insignificant difference in speed not being noise.

    Now for the more interesting part:

    # 5.78s - CORE:regcomp if( $l_line =~ m/$REGEXP_READONLY/ ) { # 0.74s - CORE:regcomp if( $l_line =~ m/${\REGEXP_CONSTANT}/ ) { # 0.00s if( $l_line =~ m/999986/ ) { # 0.73s - CORE:regcomp if( $l_line =~ m/$REGEXP_READONLY/o ) { # +/o # 1.33s - CORE:regcomp if( $l_line =~ $l_search_r ) { # +qr// # 0.74s - CORE:regcomp if( $l_line =~ m/$l_search/ ) { # 0.69s - CORE:regcomp if( $l_line =~ m/$l_search/o ) { # +/o

    We see that the first case takes about 8x longer when calling regcomp() compared to most of the others. My theory is that, since magic is involved and each time through the loop re-calls FETCH(), that a fresh copy of the read-only value is getting handed to regcomp() and so it is forced to do the string comparison. It looks to me like none of the other cases even need to compare strings.

    This means that the differences between most the other cases are so very, very tiny as to be extremely unlikely to be noticed in any real-world situation. They are differences between relatively short paths through some C code. In a Perl script, such minuscule run-times will be completely dwarfed by rather mundane stuff and so won't end up adding up to anything more than a tiny fraction of a real script's over-all run time.

    The m/999986/ is moderately interesting in that it demonstrates that the regex is actually compiled when the Perl code is compiled and Perl can completely avoid checking whether it needs to compile it again.

    The other cases show only differences that are, again, "in the noise".

    So there is no appreciable speed advantage to using /o. There are, however, significant disadvantages with regard to clarity of code and likelihood of introducing bugs.

    It is unfortunate that you have shown that the use of qr// can approximately double the time taken in regcomp(). Of course, this time still adds up to a very tiny amount that is very unlikely to add up to anything that would be noticed in a real-world situation.

    Let's look at the source code (p5git://pp_ctl.c.) and see why. Search for the pp_regcomp function. And there we see the extra work that is required in the case of qr// including a link to why this extra work is unfortunate and will likely go away at some point in the future: http://www.nntp.perl.org/group/perl.perl5.porters/2007/03/msg122415.html.

    But, again, the slight speed penalty is very unlikely to be noticed outside of a benchmark and the benefit to code clarity and maintainability (of using qr//) makes this a very easy call for me to make for myself. I use qr//. I never use /o.

    (Updated first two sentences of 2nd paragraph to not make my theory sound like something I have verified completely.)

    - tye