Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

alternation in regexes: to use or to avoid?

by dk (Chaplain)
on Dec 10, 2012 at 14:42 UTC ( #1008105=perlquestion: print w/ replies, xml ) Need Help??
dk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I've hit a performance issue with using | in regexes. It seems that in some (not-so-degenerated, actually) cases it loses significantly to looping over simple regexes i.e. $str =~ /$_// for @rx is much faster than $str =~ /$rx[0]|$rx[1]|$rx[2]/, which is rather counter-intuitive. Basically, for general cases, it would mean that alterations with grouping should be avoided at all, which is a strong statement and I wouldn't like it that way.

Is this a recognized problem? Is it a problem at all? Does it look like it needs to be reported as a bug? I can't decide myself.

Here's the test code:

use strict; use warnings; use Benchmark qw(:all); my $str = 'a' x 100; my @matchwords = qw( aol aachen aaliyah aaron abbas abbasid abbott abby abdul abe abel abel +ard abelson aberdeen abernathy abidjan abigail abilene abner abraham abram abrams +absalom abuja abyssinia abyssinian ac acadia acapulco accra acevedo achaean ); my $q1s = join('|', map { "$_\\s*\\w+" } @matchwords ); my $q1 = qr/$q1s/; my @q2s = map { "$_\\s*\\w+" } @matchwords; my @q2 = map { qr/$_/ } @q2s; my $q3s = join('|', map { "($_)\\s*\\w+" } @matchwords ); my $q3 = qr/$q3s/; my @q4s = map { "($_)\\s*\\w+" } @matchwords; my @q4 = map { qr/$_/ } @q4s; timethese( 100000, { 'alternation, no grouping' => sub { $str =~ /$q1/; }, 'loop, no grouping' => sub { for my $qr ( @q2 ) { $str =~ /$qr/; } }, 'alternation, grouping' => sub { $str =~ /$q3/; }, 'loop, grouping' => sub { for my $qr ( @q4 ) { $str =~ /$qr/; } }, });

Here's the output:

Benchmark: timing 100000 iterations ... alternation, grouping: 12 wallclock secs (11.92 usr + 0.00 sys = 11.9 +2 CPU) @ 8389.26/s (n=100000) alternation, no grouping: 0 wallclock secs ( 0.19 usr + 0.00 sys = +0.19 CPU) @ 526315.79/s (n=100000) (warning: too few iterations for a reliable count) loop, grouping: 2 wallclock secs ( 1.33 usr + 0.00 sys = 1.33 CPU) +@ 75187.97/s (n=100000) loop, no grouping: 1 wallclock secs ( 1.33 usr + 0.00 sys = 1.33 CP +U) @ 75187.97/s (n=100000)

Update: got same results on perls 5.10.1, 5.16.0, and 5.17.6

Comment on alternation in regexes: to use or to avoid?
Select or Download Code
Re: alternation in regexes: to use or to avoid?
by space_monk (Chaplain) on Dec 10, 2012 at 14:55 UTC

    Trying to sound knowledgeable without knowing why or how you got the results you did, I would suspect that this is something that you can't rely on in every version of Perl, and possibly even system to system.

    A Monk aims to give answers to those who have none, and to learn from those who know more.
      Right, forgot that ... I tried it on 5.10, 5.16, and 5.17.6, with almost identical results. As to a system, I really doubt that it matters.
Re: alternation in regexes: to use or to avoid?
by Athanasius (Prior) on Dec 10, 2012 at 15:07 UTC

    Perhaps the following quote from the Camel Book will shed some light on this question:

    Short-circuit alternation is often faster than the corresponding regex. So:

    print if /one-hump/ || /two/;

    is likely to be faster than:

    print if /one-hump|two/;

    at least for certain values of one-hump and two. This is because the optimizer likes to hoist certain simple matching operations up into higher parts of the syntax tree and do very fast matching with a Boyer-Moore algorithm. A complicated pattern tends to defeat this.
    — Tom Christiansen, brian d foy & Larry Wall with Jon Orwant, Programming Perl (4th Edition, 2012), p. 692.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Not really, because it says:

      A complicated pattern tends to defeat this.

      and i'm seeing exactly the opposite. I wish Tom would comment on that :) But thank you for the quote, it helps with understanding why I think that the observed behavior is bad.

        Perhaps read "complicated" as "non-trivial", EG: having alternations
Re: alternation in regexes: to use or to avoid?
by ww (Bishop) on Dec 10, 2012 at 15:27 UTC
    1. "Is this a recognized problem? Yes, see Athanasius', above.
    2. "Is it a problem at all? Yes, but not one that's apt to be resolved other than by careful choice among the alternate approaches.
    3. "Does it look like it needs to be reported as a bug? No; see 1 above
    4. Does this "mean that alterations with grouping should be avoided at all...?" Definitely not; sometimes the difference in speed is too small to make any difference; sometimes the clarity of one approach clearly outweighs any other issues; and sometimes other factors, like personal taste, can be allowed to determine. Just be sure to think carefully about which applies.
Re: alternation in regexes: to use or to avoid?
by RichardK (Priest) on Dec 10, 2012 at 15:31 UTC

    It's not clear to me what you are trying to achieve with your regex.

    The simple grouping that look like this

    aol\s*\w+|aachen\s*\w+|aaliyah\s*\w+|.....

    runs quickly, it's only the one with lots of capture groups that is slow. i.e

    (aol)\s*\w+|(aachen)\s*\w+|(aaliyah)\s*\w+|....

    So maybe there's just a better way to get the result you want, if you'd care to explain what that is?

      (I work with dk.)

      Another question could be: why is the one with the capture groups so slow, since none of the words match the string?

      And in general, why is alternation&capture so much slower than looping&capture + alternation combined?

      The reason for the code is to replace code with 60 or so similarly structured regexes in a library used by a couple of legacy applications with an automatically generated regex generated with info from configuration files, for both (potential) performance gains, allowing different behaviour across applications, and definite maintainability gains. The strings replaced all have the structure \bFOO:\s*bar(\d+) or \bBAZ:\s*(\w+) etc.

      Suggestions like "Well, don't do that" are likely to go unheard :-)

        OK then, If you want to use a non-optimal solution for operational reasons, go right ahead :-)

      Added to balker's response, it's not that we're trying to achieve, we know other means how to get where we want to, but it's about the principle I've long nourished, (see Anastasius's quote above), and now it doesn't hold water. What i'd love to see, an explanation of someone who knows why regex algorithm exhibits behavior that is CONTRARY to perl lore.
Re: alternation in regexes: to use or to avoid?
by Anonymous Monk on Dec 10, 2012 at 16:44 UTC
    The amount of time the regex takes to execute is insignificant in the long run. What matters is how much I/O the program does or doesn't do, and that includes virtual-memory. Process the data in reasonably sized chunks, applying whatever you might know about which test is most likely to succeed first. Make the whole thing easy to maintain. Don't sweat nanoseconds when it's milliseconds that matter.
      Thank you, but your statement is contradicting itself. We're seeing milliseconds wasted in the regex (for a web-app!), which is why we bothered to examine why in the first place.
Re: alternation in regexes: to use or to avoid?
by dave_the_m (Parson) on Dec 10, 2012 at 17:08 UTC
    Your problem is that you are including the \s*\w+ and captures within the alternation. Move them outside and you'll find the alternations are suddenly much faster than the loops. This is because alternations containing just fixed strings can be much better optimised (using tries). With the following changes:

    my $q1s = join('|', @matchwords); my $q1 = qr/$q1s\s*\w+/; my $q3s = join('|', @matchwords); my $q3 = qr/($q3s)\\s*\\w+/;
    and setting the benchmark count 10x larger, I get on 5.17.6:
    alternation, grouping: 1 wallclock secs ( 0.58 usr + 0.00 sys = 0.5 +8 CPU) @ 1724137.93/s (n=1000000) alternation, no grouping: 5 wallclock secs ( 4.85 usr + 0.00 sys = +4.85 CPU) @ 206185.57/s (n=1000000) loop, grouping: 24 wallclock secs (23.33 usr + 0.00 sys = 23.33 CPU) +@ 42863.27/s (n=1000000) loop, no grouping: 23 wallclock secs (22.57 usr + 0.00 sys = 22.57 CP +U) @ 44306.60/s (n=1000000)

    Dave.

      But that doesn't leave room for the various match-words to have different "\w+"'s - e.g. /\bFOO:\s*bar(\d+)/ or /\bBAZ:\s*(\w+)/

      I guess I'm wondering why the trie-optimization isn't used for fixed string prefixes as well (as I'd assumed before I wrote the code).

        Actually I stand corrected: the fixed string prefixes are collected together into a trie where there are (possibly differing) wildcard suffixes; the killer is the individual captures, which disables the trie optimisation.

        So, alternation is the fastest, as long as you put any captures outside the alt.

        Dave.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1008105]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-07-10 05:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (199 votes), past polls