Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Question about regex performance

by ted.byers (Scribe)
on Jun 17, 2014 at 22:06 UTC ( #1090217=perlquestion: print w/ replies, xml ) Need Help??
ted.byers has asked for the wisdom of the Perl Monks concerning the following question:

Please consider the following two programs (both of which work fine - suggestions for better function names welcome). The first replaces redundant white space characters with a single white space character.

#!/usr/bin/perl -d:NYTProf use warnings; use strict; use String::Util 'trim'; use Benchmark qw(cmpthese timethese); cmpthese( -30, { compress_1 => q|compress_1(' Mary had a little lamb. ' +);|, compress_2 => q|compress_2(' Mary had a little lamb. ' +);|, compress_3 => q|compress_3(' Mary had a little lamb. ' +);|, squash => q|squash(' Mary had a little lamb. ');|, split_join => q|split_join(' Mary had a little lamb. ' +);|, } ); print "'compress_1' => '",compress_1(' Mary had a little la +mb. '),"'\n"; print "'compress_2' => '",compress_2(' Mary had a little la +mb. '),"'\n"; print "'compress_3' => '",compress_3(' Mary had a little la +mb. '),"'\n"; print "'squash' => '",squash(' Mary had a little lamb. ')," +'\n"; print "'split_join' => '",split_join(' Mary had a little la +mb. '),"'\n"; exit; sub compress_1 { my $string = shift; $string =~ s/ +/ /g; return $string; } sub compress_2 { my $string = shift; $string =~ s/\h+/ /g; return $string; } sub compress_3 { my $string = shift; $string =~ s/ {1,}/ /g; return $string; } sub squash { my $string = shift; $string =~ tr/ //s; return $string; } sub split_join { my $string = shift; $string = join ' ', split ' ', $string; return $string; }

The next trims leading and trailing whitespace.

#!/usr/bin/perl -d:NYTProf use warnings; use strict; use String::Util 'trim'; use Benchmark qw(cmpthese timethese); cmpthese( -30, { 'double_star' => q|double_star(' Mary had a little lamb. ');|, 'double_plus' => q|double_plus(' Mary had a little lamb. ');|, 'double_plus2' => q|double_plus(' Mary had a little lamb. Mar +y had a little lamb. Mary had a little lamb. Mary had a little lamb +. Mary had a little lamb. Mary had a little lamb. Mary had a littl +e lamb. Mary had a little lamb. Mary had a little lamb. Mary had a + little lamb. Mary had a little lamb. Mary had a little lamb. Mary + had a little lamb. Mary had a little lamb. ');|, 'replace' => q|replace( ' Mary had a little lamb. ');|, 'for_star' => q|for_star( ' Mary had a little lamb. ');|, 'for_plus' => q|for_plus( ' Mary had a little lamb. ');|, 'regex_or' => q|regex_or( ' Mary had a little lamb. ');|, 'one_liner' => q|one_liner( ' Mary had a little lamb. ');|, 'trim' => q|trim( ' Mary had a little lamb. ');|, } ); print "'trim' => '",trim(' Mary had a little lamb. '),"'\n"; print "'double_star' => '",double_star(' Mary had a little lamb. '),"' +\n"; print "'double_plus' => '",double_plus(' Mary had a little lamb. '),"' +\n"; print "'double_plus2' => '",double_plus(' Mary had a little lamb. Mar +y had a little lamb. Mary had a little lamb. Mary had a little lamb +. Mary had a little lamb. Mary had a little lamb. Mary had a littl +e lamb. Mary had a little lamb. Mary had a little lamb. Mary had a + little lamb. Mary had a little lamb. Mary had a little lamb. Mary + had a little lamb. Mary had a little lamb. '),"'\n"; print "'replace' => '",replace( ' Mary had a little lamb. '),"'\n"; print "'for_star' => '",for_star( ' Mary had a little lamb. '),"'\n"; print "'for_plus' => '",for_plus( ' Mary had a little lamb. '),"'\n"; print "'regex_or' => '",regex_or( ' Mary had a little lamb. '),"'\n"; print "'one_liner' => '",one_liner( ' Mary had a little lamb. '),"'\n" +; exit; sub one_liner { my $string = shift; # $string =~ s/^\ *([A-Z,a-z,0-9]*)\ *$/$1/g; $string =~ s/^\s+|\s+$//g ; return $string; } sub double_star { my $string = shift; $string =~ s/^\s*//; $string =~ s/\s*$//; return $string; } sub double_plus { my $string = shift; $string =~ s/^\s+//; #remove leading spaces $string =~ s/\s+$//; #remove trailing spaces return $string; } sub replace { my $string = shift; $string =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/; return $string; } sub for_star { my $string = shift; for ($string) { s/^\s+//; s/\s+$//; } return $string; } sub for_plus { my $string = shift; for ($string) { s/^\s*//; s/\s*$//; } return $string; } sub regex_or { my $string = shift; $string =~ s/(?:^ +)||(?: +$)//g; return $string; }

And here is what I get when I execute these programs.

ted@linux-jp04:~/Work/Projects/misc.tests> ./compress.multiple.spaces. +to.single.space.pl Rate compress_3 compress_1 compress_2 split_join sq +uash compress_3 135174/s -- -2% -6% -34% +-45% compress_1 137798/s 2% -- -4% -33% +-44% compress_2 143178/s 6% 4% -- -30% +-42% split_join 205421/s 52% 49% 43% -- +-17% squash 247547/s 83% 80% 73% 21% + -- 'compress_1' => ' Mary had a little lamb. ' 'compress_2' => ' Mary had a little lamb. ' 'compress_3' => ' Mary had a little lamb. ' 'squash' => ' Mary had a little lamb. ' 'split_join' => 'Mary had a little lamb.' ted@linux-jp04:~/Work/Projects/misc.tests> ./trim.ws.pl Rate double_plus2 regex_or trim for_plus for_star dou +ble_star one_liner double_plus replace double_plus2 69971/s -- -5% -21% -28% -36% + -37% -43% -46% -46% regex_or 73562/s 5% -- -17% -24% -33% + -34% -40% -43% -44% trim 88942/s 27% 21% -- -8% -19% + -20% -27% -32% -32% for_plus 96591/s 38% 31% 9% -- -12% + -13% -21% -26% -26% for_star 109941/s 57% 49% 24% 14% -- + -1% -10% -16% -16% double_star 111060/s 59% 51% 25% 15% 1% + -- -9% -15% -15% one_liner 122651/s 75% 67% 38% 27% 12% + 10% -- -6% -6% double_plus 130149/s 86% 77% 46% 35% 18% + 17% 6% -- -0% replace 130236/s 86% 77% 46% 35% 18% + 17% 6% 0% -- + 'trim' => 'Mary had a little lamb.' + + 'double_star' => 'Mary had a little lamb.' + + 'double_plus' => 'Mary had a little lamb.' + + 'double_plus2' => 'Mary had a little lamb. Mary had a little lamb. M +ary had a little lamb. Mary had a little lamb. Mary had a little la +mb. Mary had a little lamb. Mary had a little lamb. Mary had a lit +tle lamb. Mary had a little lamb. Mary had a little lamb. Mary had + a little lamb. Mary had a little lamb. Mary had a little lamb. Ma +ry had a little lamb.' + + 'replace' => 'Mary had a little lamb.' + + 'for_star' => 'Mary had a little lamb.' + + 'for_plus' => 'Mary had a little lamb.' + + 'regex_or' => 'Mary had a little lamb.' + + 'one_liner' => 'Mary had a little lamb.' + + ted@linux-jp04:~/Work/Projects/misc.tests> + +

First, I would like to understand the differences in performance among these regular expressions. I realize that the specific numbers will depend on the hardware being used and it's load, but I am interested in the ranking (aside from the obvious that applying the functions to a much longer string will impact these numbers). And, related to this, do they scale differently, or will I get the same ranking of functions regardless of the length of string? Second, I am curious as to why the split/join approach is so much faster than the fastest regular expression. Thirdly, I can see that if I want to both trim leading and trailing white space AND compress sequences of white space characters by a single space, I can use the split/join algorithm, but what about combining the regular expressions? I have included ONLY those regular expressions and algorithms that I found on the web, and one or two I came up with, and tested to work as advertised, but are there other regular expressions and/or functions that will serve one or the other or both functional requirements that would be faster still?

Thanks

ted

Why do I get this ridiculous splitting of my lines of code, so that the code begins at the far left of my screen and stops only a quarter of the way across my screen, and is there a way to stop that?

Comment on Question about regex performance
Select or Download Code
Re: Question about regex performance
by Anonymous Monk on Jun 17, 2014 at 22:13 UTC
Re: Question about regex performance
by choroba (Abbot) on Jun 17, 2014 at 22:20 UTC
    regex_or seems wrong. It uses space, while the other regexes use \s. It also groups for no reason, and || contains an empty regex in the middle. I'd change it to
    s/^\s+|\s+$//g
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thanks.

      Define 'wrong', please. Is it wrong if it gives the right answer? Can you provide a test case that makes it fail? I do like test driven development since the tests define what is right or wrong (as long as they are carefully constructed, and defined by the functional requirements). My usual approach is to make the code right, first, as defined by the suite of unit and integration tests, and then make it fast (hence my search for alternate ways of doing the same thing). In some respects, though, regex_or is one I found, and it remains one I am not entirely sure I understand how or why it works. Regex are not my forte: I am more a quant, working with numeric methods: when working up regular expressions, I still have to have a reference for them open while I try to construct them.

      I do like the approach you say you'd replace regex_or with, but that is one_liner. And that is still about 10% slower than both replace and double_plus, as is double_star. Do you have any thoughts as to why one_liner and double_star are about the same speed, why replace and double_plus are about the same speed, and why the first two are about 10% slower than the latter two? More importantly, is there a way to have code that verifies any conjecture as to why, or does such an explanation have to rest on arguments about how regex work?

      Thanks again

      Ted

        OK, "wrong" was probably a too strong word (I'm not a native speaker of English). But compare the following two expressions, both being "right" from the test driven perspective:
        # 1 $x = 'a' . $y . $z->[1]; # 2 $x = join q(), scalar(chr 97), $y, (@$z)[1], '';
        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Question about regex performance
by Anonymous Monk on Jun 17, 2014 at 22:56 UTC
Re: Question about regex performance (too small)
by tye (Cardinal) on Jun 18, 2014 at 02:19 UTC

    The results mostly show that you are concentrating on the wrong things.

    Even "69971/s" vs. "130236/s", although conceptually "twice as fast", is unlikely to have other than trivial impact on the time required to run a useful script. Things that run that quickly are just hard to make add up to much of the run time of a useful script (in particular because Benchmark.pm "subtracts overhead" which makes such measurements rather distant from practical reality).

    So, you've done quite a bit of work to determine that, if you write a script that does something useful that still mostly consists of trimming whitespace from even tons of strings of about that length, then which method you chose won't have much impact.

    If you re-run the benchmark with much, much longer strings, then you'll likely get results that have more practical meaning (as the "overhead" is relatively less significant so the misguided "subtraction" process is unlikely to drastically change the results).

    This makes sense as the usual way that you notice that a regex is excessively inefficient is because you ran it against particularly long strings, not because you ran it against a huge number of very short strings, again, because the overhead tends to make the inefficiency have a much smaller impact on the total run time. But also because one of the biggest ways to make a regex very inefficient is "bad backtracking" which scales linearly with the number of string but can scale O(n**2) or much worse with the string length.

    But if you aren't going to be running these regexes against really long strings, then none of that actually matters.

    If you are actually trying to make some script run faster, then, instead of picking random small operations and trying dozens of variations hoping to find "the fastest", you should profile the code to see what (if anything) is actually taking up any significant part of the run time. And/or looking at the larger structure, where optimizations actually have a chance of having significant impact on the total run time.

    If you are just hoping to find "the best way" with Benchmark.pm results being a significant part of your grading process, then I also think you are paying too much attention to the wrong things. A construct that is 5% faster to write or understand or get correct is likely to have a much bigger impact than whether the computer can run it 20% faster, IME, especially since such constructs so rarely add up to more than a small part of the total run time.

    Also, lots of times there is no one "fastest" approach. For example, it is pretty common for method X to be faster than method Y when a string has tons of extra whitespace while method Y is faster when the string has no extra whitespace.

    So, if I were bored, I'd have tried a variation that does nothing when nothing needs to be done, like: s/ {2,}/ /g. I'm sure you can come up with at least half a dozen variations on that to add to your quest. :)

    - tye        

      Thanks.

      You are absolutely right. The fact is that if I were working on improving the speed of production code, I would begin with profiling. In THAT activity, there is a balance of costs between how quickly code can be developed versus how fast the code can be. I have seen cases where the fastest code is extremely hard for junior or even intermediate coders to understand. Sometime it is necessary to implement that anyway because the cost in lost time due to slower algorithms is much greater. But sometimes, the simpler, slower code is desired because it can be implemented quickly by your least experienced staff.

      Alas, you missed the point of the exercise. My objective is to understand these regular expressions better. I rarely develop them, and when I do, I must have the manual for regular expressions open, in order to figure out how to develop one that meets my needs. I do not need such assistance when I deal with solving systems of linear equations, numeric quadrature, or statistical analysis. I am, in a sense, pushing myself out of my comfort zone. In this context, the benchmark scripts I show are mere devices to provide one way of evaluating the merits of the different algorithms I found or developed. And, while I do intend to modify them to work with longer strings, what I was especially hoping for is some insight into the reasons for the difference in performance and how to combine the regular expressions that trim leading and trailing white space with those that eliminate redundant white space characters within strings; or if such a combination even makes sense.

      Thanks

      Ted

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1090217]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2014-09-16 01:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (155 votes), past polls