Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Unefficient Regexp 'Matching this or that'

by pelagic (Curate)
on May 19, 2010 at 08:20 UTC ( #840634=perlquestion: print w/ replies, xml ) Need Help??
pelagic has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

I got this case of an unefficient regexp handling when matching strings in a large file:

To look for ONE string takes 2 seconds
while looking for TWO strings takes 79 seconds. Here is the code:

use strict; use Benchmark; my $file = shift || 'no_file'; timethese( 1, { 'one_string' => sub { one_string() }, 'two_string' => sub { two_string() }, } ); sub one_string { my $filter = '00901808'; my $re = qr/$filter/o; my @matched; open (my $FH, "<$file"); while (my $rec = <$FH>) { if ( $rec =~ $re) { push @matched, $rec; } } close $FH; } sub two_string { my $filter = '00901808|87654321'; my $re = qr/$filter/o; my @matched; open (my $FH, "<$file"); while (my $rec = <$FH>) { if ( $rec =~ $re) { push @matched, $rec; } } close $FH; } __END__
The result says:
# perl bench_regexp 100000lines.92MB.file Benchmark: timing 1 iterations of one_string, two_string... one_string: 2 wallclock secs ( 1.68 usr + 0.42 sys = 2.10 CPU) @ 0 +.48/s (n=1) (warning: too few iterations for a reliable count) two_string: 77 wallclock secs (76.13 usr + 0.59 sys = 76.72 CPU) @ 0 +.01/s (n=1) (warning: too few iterations for a reliable count)

pelagic

Comment on Unefficient Regexp 'Matching this or that'
Select or Download Code
Re: Unefficient Regexp 'Matching this or that'
by moritz (Cardinal) on May 19, 2010 at 08:35 UTC

    Try also to run them in reversed order - maybe reading the file is slow the first time, and much faster then second time because the file is then in some cache from the operating system.

    Also please compare the number of matches found in both cases - is it comparable?

    Perl 6 - links to (nearly) everything that is Perl 6.
      I actually tried it in reverse order also by calling the tests:
      01_two_string 02_one_string
      And the number of matches is the same, I just stripped this away to make the code smaller.

      pelagic
        Another idea: The regex engine has a heavy optimization for looking for literal substrings, using the same algorithm that index uses.

        In the case of one literal that can be used, but I don't think it's used in alternations.

        You can test this hypothesis by comparing regexes of two and three alternatives - if I'm right, then both should be roughly equal in speed, and much slower than a single, literal string.

Re: Unefficient Regexp 'Matching this or that'
by bart (Canon) on May 19, 2010 at 08:55 UTC
    Alternatives are slow. Jeffrey Friedl wrote about it in his book about regexes, and your benchmark proves it.

    I expect an improvement if you just add a common lookahead:

    $filter = qr/(?=[08])(?:00901808|87654321)/;
    but I have no way to test it without your data file.

    See also the modules Regex::PreSuf, Regexp::List and Regexp::Assemble for simpler solutions, as they can build a (hopefully) more efficient regex for you.

    p.s. I just found this blog post about just this situation.

    Alternation is usually slow in Perl because the engine has to backtrack when trying each alternative. It's much faster to give perl a character sieve up front, e.g. (?=cdb) and then factor out common prefixes and suffixes. The problem is that when you have a ton of alternatives, doing all this is a pain and it decreases readability to almost zero. Which is why I had avoided it to date...
      Here goes my test for your suggestion:
      Benchmark: timing 3 iterations of 20.two_string, 30.two_string_lookahe +ad... 20.two_string: 236 wallclock secs (234.31 usr + 1.62 sys = 235.93 CPU +) @ 0.01/s (n=3) 30.two_string_lookahead: 178 wallclock secs (176.23 usr + 1.52 sys = +177.75 CPU) @ 0.02/s (n=3)
      So it's better, but still not linear ...

      pelagic
Re: Unefficient Regexp 'Matching this or that'
by BrowserUk (Pope) on May 19, 2010 at 10:11 UTC

    You might consider using separate passes for each string to be found (or a loop over them). The following, with values suited to a large file I had kicking around, ran substantially faster than the alernation regex:

    sub two_pass { my @matched; open (my $FH, "<$file"); while (my $rec = <$FH>) { if ( $rec =~ '500_000' ) { push @matched, $rec; } if ( $rec =~ '500_001' ) { push @matched, $rec; } } print "two pass: " . @matched; close $FH; }

    Actually, it consistently runs faster than the single string sub, even if I force it to run first, last or in the middle. Whatever position, the timings of all three methods remain very consistent I don't have an explanation for that? I've noticed in the past that with the 5.10+ trie optimisation, alternation regexes are sometimes slower than they are under 5.8. There seems to be a "sweet spot" where it comes into its own: neither too few nor too many alternations and it flies, but below or above those limits and it actually takes longer.

    For example, on 5.10, I get these (quite surprising) results consistently:

    c:\test>junk31 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 6 wallclock secs ( 5.69 usr + 0.78 sys = 6.47 CPU) @ 0 +.15/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 15 wallclock secs (14.34 usr + 0.73 sys = 15.07 CPU) @ + 0.07/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 15 wallclock secs (14.76 usr + 0.76 sys = 15.52 CPU) @ + 0.06/s (n=1) (warning: too few iterations for a reliable count)

    But on 5.8 I get these rather more understandable results:

    c:\test>\perl32\bin\perl junk31.pl 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 8 wallclock secs ( 6.60 usr + 0.69 sys = 7.29 CPU) @ 0 +.14/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 53 wallclock secs (52.99 usr + 0.70 sys = 53.70 CPU) @ + 0.02/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 7 wallclock secs ( 5.60 usr + 0.92 sys = 6.52 CPU) @ + 0.15/s (n=1) (warning: too few iterations for a reliable count)

    Maybe it shoudl be possible to disable the trie optimisation for thos cases of alternation where it doesn't benefit.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Maybe it shoudl be possible to disable the trie optimisation for thos cases of alternation where it doesn't benefit.
      Maybe you can disable trie optimizations by setting ${^RE_TRIE_MAXBUF} = 0?
      Perl 6 - links to (nearly) everything that is Perl 6.

        That seems to work. !5 seconds instead of 53 for the alternation is much more reasonable. Still doesn't beat the two pass method, but there is probably a break point somewhere.

        c:\test>junk31 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 7 wallclock secs ( 5.60 usr + 0.83 sys = 6.43 CPU) @ 0 +.16/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 15 wallclock secs (14.35 usr + 0.94 sys = 15.29 CPU) @ + 0.07/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 16 wallclock secs (14.76 usr + 0.73 sys = 15.49 CPU) @ + 0.06/s (n=1) (warning: too few iterations for a reliable count)

        I'm still puzzled why the two pass takes less time than the one string, but the numbers are right and I can't anything major wrong with the benchmark?

Re: Unefficient Regexp 'Matching this or that'
by rovf (Priest) on May 19, 2010 at 10:41 UTC
    If you change in the one_string version the filter from '00901808' to '87654321', does the execution time stay the same?

    Also, if you replace the usage of regexps by a call to the index function (which in your case should be the same), and then compare the one_string with the two_string version, how does the time increase in the second case?

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: Unefficient Regexp 'Matching this or that'
by JavaFan (Canon) on May 19, 2010 at 16:35 UTC
    What version of perl are you using? I think this case got a heavy speed up in 5.10. How does your data look like? How many matches are there in each case? Can it be that the optimizer decides "no match" for the 00901808 case, while it cannot do that for 87654321? How does /00901808/ || /87654321/ perform?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://840634]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2014-10-22 12:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (117 votes), past polls