Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

regex alternation

by Anonymous Monk
on Sep 30, 2011 at 10:02 UTC ( #928732=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In the code below I want both strings in the test_array to match with the regex $test3.

This one works but I wonder if there is a more efficient way of doing it than the one I came up with? I ask because the regex will be run thousands of times and performance is a factor.

my $test3 = qr<^in directory '/directory/tmp(/\..*'|')$>; my @test_array = ( "in directory '/directory/tmp'", "in directory '/directory/tmp/.idBAA5KaaYWfb'" ); for my $line (@test_array) { print "TESTING <$line>\n"; if ($line =~ $test3) { print "MATCHED\n"; } else { print "NO MATCH\n"; } }

Comment on regex alternation
Download Code
Re: regex alternation
by jethro (Monsignor) on Sep 30, 2011 at 10:18 UTC
    One obvious improvement would be to use the regex modifier o to tell perl that the regex won't change.
Re: regex alternation
by moritz (Cardinal) on Sep 30, 2011 at 10:19 UTC

    You can try something along these lines to avoid invoking the regex engine:

    my $prefix = "in directory '/directory/tmp"; my $prefix_len = length $prefix; my $test3 = qr<^$prefix(/\..*'|')$>; for my $line (@test_array) { if (substr($line, 0, $prefix_len) eq $prefix && $line =~ $test3) { print "MATCHED\n"; } else { print "NO MATCH\n"; } }

    But you need to Benchmark it to find out if it's actually faster with your data.

Re: regex alternation
by keszler (Priest) on Sep 30, 2011 at 10:24 UTC

    Why capture the closing single-quote? If you don't need it the alternation is unnecessary overhead. qr<^in directory '/directory/tmp(/\..*)?'$> would be better in that case. Also, if you're not planning to do anything with $1 after the match then there's no need to capture at all: qr<^in directory '/directory/tmp(?:/\..*)?'$>

    For overall regex performance it's hard - for me anyway - to determine without knowing what the non-matching data looks like. In general you want the non-matching tests to fail as quickly as possible. (Your anchoring "in directory" to the start of the string is good for that.)

    update - moritz's prefix method is even better

Re: regex alternation
by AlexTape (Monk) on Sep 30, 2011 at 10:32 UTC
    at first it will be a bit faster if you dont use an alias for $_. perhaps try map to increase the speed.. im intrested too in this case.. probably there is a solution via hashing.. u need exactly match||no match prints? or you "only" need the matched lines? <= if so it can be done faster 4 sure..
    for(@test_array){print "MATCHED\n" if $_ =~ $test3 && next; print "NO MATCH\n";} #untested
    futhermore check:
    How do I optimize a regular expression? and
    Perl regexp matching is slow??

    $perlig =~ s/pec/cep/g if 'errors expected';
Re: regex alternation
by Anonymous Monk on Sep 30, 2011 at 10:50 UTC

    Hooray, make of this what you will, at your own peril

    And see also: Why does global match run faster than none global?, Multiple Regex evaluations or one big one?

    #!/usr/bin/perl -- use strict; use warnings; use Benchmark qw[ cmpthese ]; print "##########\n"; print "perl $] \n"; my @test_array = ( "in directory '/directory/tmp'", "in directory '/directory/tmp/.idBAA5KaaYWfb'", "in directory '/directory/not'", ( "in directory '/directory/notit'" ) x 2, ); cmpthese -3, { a_ => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(/\..*'|')$>; for (@test_array) { $match++ if $_ =~ $test3; } }, a => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(/\..*'|')$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, b => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(?:/\..*'|')$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, c => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(?:'|/\..*')$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, f => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp(?:/\..*)?'$>; for my $line (@test_array) { $match++ if $line =~ $test3; } }, g => sub { my $match = 0; my $prefix = q<in directory '/directory/tmp>; my $prefix_len = length $prefix; my $test3 = qr<^$prefix(/\..*'|')$>; for my $line (@test_array) { $match++ if substr($line, 0, $prefix_len) eq $prefix and $line =~ $test3 } }, d => sub { my $match = 0; my $test3 = qr<^in directory '/directory/tmp'$>; my $test4 = qr<^in directory '/directory/tmp/\..*'$>; for my $line (@test_array) { $match++ if $line =~ $test3 or $line =~ $test4; } }, e => sub { my $match = 0; my $test3 = q<in directory '/directory/tmp'>; my $test4 = qr<^in directory '/directory/tmp/\..*'$>; for my $line (@test_array) { $match++ if $line eq $test3 or $line =~ $test4; } }, }; __END__ ########## perl 5.006001 Benchmark: running a, a_, b, c, d, e, f, g, each for at least 3 CPU se +conds... a: 3 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 19 +0674.36/s (n=577934) a_: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 19 +0674.36/s (n=577934) b: 4 wallclock secs ( 3.22 usr + 0.00 sys = 3.22 CPU) @ 20 +6861.45/s (n=665887) c: 3 wallclock secs ( 3.06 usr + 0.00 sys = 3.06 CPU) @ 21 +3192.95/s (n=653010) d: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 15 +6451.74/s (n=486252) e: 4 wallclock secs ( 3.25 usr + 0.00 sys = 3.25 CPU) @ 21 +6385.04/s (n=703035) f: 4 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 19 +4816.72/s (n=605880) g: 3 wallclock secs ( 3.22 usr + 0.00 sys = 3.22 CPU) @ 14 +8028.89/s (n=476505) Rate g d a_ a f b c e g 148029/s -- -5% -22% -22% -24% -28% -31% -32% d 156452/s 6% -- -18% -18% -20% -24% -27% -28% a_ 190674/s 29% 22% -- -0% -2% -8% -11% -12% a 190674/s 29% 22% 0% -- -2% -8% -11% -12% f 194817/s 32% 25% 2% 2% -- -6% -9% -10% b 206861/s 40% 32% 8% 8% 6% -- -3% -4% c 213193/s 44% 36% 12% 12% 9% 3% -- -1% e 216385/s 46% 38% 13% 13% 11% 5% 1% -- ########## perl 5.014001 Rate d a_ a f b c g e d 49781/s -- -31% -31% -34% -35% -38% -38% -46% a_ 71676/s 44% -- -1% -4% -6% -10% -11% -22% a 72090/s 45% 1% -- -4% -6% -10% -10% -21% f 75019/s 51% 5% 4% -- -2% -6% -6% -18% b 76633/s 54% 7% 6% 2% -- -4% -4% -16% c 79672/s 60% 11% 11% 6% 4% -- -1% -13% g 80135/s 61% 12% 11% 7% 5% 1% -- -13% e 91585/s 84% 28% 27% 22% 20% 15% 14% --
      Interesting... Is that a debug build of 5.14? Because those rates are WAAAY slower than you got for 5.6.

      Mike

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://928732]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (16)
As of 2014-04-16 14:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (430 votes), past polls