Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

regex alternation

by Anonymous Monk
on Sep 30, 2011 at 10:02 UTC ( #928732=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In the code below I want both strings in the test_array to match with the regex $test3.

This one works but I wonder if there is a more efficient way of doing it than the one I came up with? I ask because the regex will be run thousands of times and performance is a factor.

my $test3 = qr<^in directory '/directory/tmp(/\..*'|')$>; my @test_array = ( "in directory '/directory/tmp'", "in directory '/directory/tmp/.idBAA5KaaYWfb'" ); for my $line (@test_array) { print "TESTING <$line>\n"; if ($line =~ $test3) { print "MATCHED\n"; } else { print "NO MATCH\n"; } }

Comment on regex alternation
Download Code
Re: regex alternation
by jethro (Monsignor) on Sep 30, 2011 at 10:18 UTC
    One obvious improvement would be to use the regex modifier o to tell perl that the regex won't change.
Re: regex alternation
by moritz (Cardinal) on Sep 30, 2011 at 10:19 UTC

    You can try something along these lines to avoid invoking the regex engine:

    my $prefix = "in directory '/directory/tmp"; my $prefix_len = length $prefix; my $test3 = qr<^$prefix(/\..*'|')$>; for my $line (@test_array) { if (substr($line, 0, $prefix_len) eq $prefix && $line =~ $test3) { print "MATCHED\n"; } else { print "NO MATCH\n"; } }

    But you need to Benchmark it to find out if it's actually faster with your data.

Re: regex alternation
by keszler (Priest) on Sep 30, 2011 at 10:24 UTC

    Why capture the closing single-quote? If you don't need it the alternation is unnecessary overhead. qr<^in directory '/directory/tmp(/\..*)?'$> would be better in that case. Also, if you're not planning to do anything with $1 after the match then there's no need to capture at all: qr<^in directory '/directory/tmp(?:/\..*)?'$>

    For overall regex performance it's hard - for me anyway - to determine without knowing what the non-matching data looks like. In general you want the non-matching tests to fail as quickly as possible. (Your anchoring "in directory" to the start of the string is good for that.)

    update - moritz's prefix method is even better

Re: regex alternation
by AlexTape (Monk) on Sep 30, 2011 at 10:32 UTC
    at first it will be a bit faster if you dont use an alias for $_. perhaps try map to increase the speed.. im intrested too in this case.. probably there is a solution via hashing.. u need exactly match||no match prints? or you "only" need the matched lines? <= if so it can be done faster 4 sure..
    for(@test_array){print "MATCHED\n" if $_ =~ $test3 && next; print "NO MATCH\n";} #untested
    futhermore check:
    How do I optimize a regular expression? and
    Perl regexp matching is slow??

    $perlig =~ s/pec/cep/g if 'errors expected';
Re: regex alternation
by Anonymous Monk on Sep 30, 2011 at 10:50 UTC

    Hooray, make of this what you will, at your own peril

    And see also: Why does global match run faster than none global?, Multiple Regex evaluations or one big one?

    #!/usr/bin/perl -- use strict; use warnings; use Benchmark qw[ cmpthese ]; print "##########\n"; print "perl $] \n"; my @test_array = ( "in directory '/directory/tmp'", "in directory '/directory/tmp/.idBAA5KaaYWfb'", "in directory '/directory/not'", ( "in directory '/directory/notit'" ) x 2, ); cmpthese -3, { a_ => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(/\..*'|')$>; for (@test_array) { $match++ if $_ =~ $test3; } }, a => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(/\..*'|')$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, b => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(?:/\..*'|')$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, c => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(?:'|/\..*')$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, f => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(?:/\..*)?'$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, g => sub { my $match = 0; my $prefix = q<in directory '/directory/tmp>; my $prefix_len = length $prefix; my $test3 = qr<^$prefix(/\..*'|')$>; for my $line (@test_array) { $match++ if substr($line, 0, $prefix_len) eq $prefix and $line =~ $test3 } }, d => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp'$>; my $test4 = qr<^in directory '/directory/tmp/\..*'$>; for my $line (@test_array) { $match++ if $line =~ $test3 or $line =~ $test4; } }, e => sub { my $match = 0; my $test3 = q<in directory '/directory/tmp'>; my $test4 = qr<^in directory '/directory/tmp/\..*'$>; for my $line (@test_array) { $match++ if $line eq $test3 or $line =~ $test4; } }, }; __END__ ########## perl 5.006001 Benchmark: running a, a_, b, c, d, e, f, g, each for at least 3 CPU se +conds... a: 3 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 19 +0674.36/s (n=577934) a_: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 19 +0674.36/s (n=577934) b: 4 wallclock secs ( 3.22 usr + 0.00 sys = 3.22 CPU) @ 20 +6861.45/s (n=665887) c: 3 wallclock secs ( 3.06 usr + 0.00 sys = 3.06 CPU) @ 21 +3192.95/s (n=653010) d: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 15 +6451.74/s (n=486252) e: 4 wallclock secs ( 3.25 usr + 0.00 sys = 3.25 CPU) @ 21 +6385.04/s (n=703035) f: 4 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 19 +4816.72/s (n=605880) g: 3 wallclock secs ( 3.22 usr + 0.00 sys = 3.22 CPU) @ 14 +8028.89/s (n=476505) Rate g d a_ a f b c e g 148029/s -- -5% -22% -22% -24% -28% -31% -32% d 156452/s 6% -- -18% -18% -20% -24% -27% -28% a_ 190674/s 29% 22% -- -0% -2% -8% -11% -12% a 190674/s 29% 22% 0% -- -2% -8% -11% -12% f 194817/s 32% 25% 2% 2% -- -6% -9% -10% b 206861/s 40% 32% 8% 8% 6% -- -3% -4% c 213193/s 44% 36% 12% 12% 9% 3% -- -1% e 216385/s 46% 38% 13% 13% 11% 5% 1% -- ########## perl 5.014001 Rate d a_ a f b c g e d 49781/s -- -31% -31% -34% -35% -38% -38% -46% a_ 71676/s 44% -- -1% -4% -6% -10% -11% -22% a 72090/s 45% 1% -- -4% -6% -10% -10% -21% f 75019/s 51% 5% 4% -- -2% -6% -6% -18% b 76633/s 54% 7% 6% 2% -- -4% -4% -16% c 79672/s 60% 11% 11% 6% 4% -- -1% -13% g 80135/s 61% 12% 11% 7% 5% 1% -- -13% e 91585/s 84% 28% 27% 22% 20% 15% 14% --
      Interesting... Is that a debug build of 5.14? Because those rates are WAAAY slower than you got for 5.6.

      Mike

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://928732]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (17)
As of 2014-12-18 15:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (58 votes), past polls