Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: More than one pattern match using grep on a file

by choroba (Abbot)
on Apr 04, 2014 at 13:31 UTC ( #1081134=note: print w/ replies, xml ) Need Help??


in reply to More than one pattern match using grep on a file

Instead of looping over the patterns, create one large pattern by

my $large_pattern = join '|', @PatternList;

To grep an array for matching members, just do

my @matches = grep /$large_pattern/, @array;
لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ


Comment on Re: More than one pattern match using grep on a file
Select or Download Code
Re^2: More than one pattern match using grep on a file
by bigj (Monk) on Apr 04, 2014 at 14:05 UTC
    It's probably in addition an idea to add a /o modifier to the regexp for speed optimization (will lead to compiling the regexp just once, what is better for large @arrays or @pattern_lists).

    Greetings,
    Janek

      IMHO, this is actually bad advice. /o can cause some confusing bugs and is generally a case of premature optimization. If I run the benchmark code:
      #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; local $" = '|'; my $target = join '', map chr(97 + rand 26), 1 .. 100000; my @patterns = map {join '', map chr(97 + rand 26), 1 .. 5 } 1 .. 1 +00; my @res = map qr/$_/, @patterns; my $whole_pat = "@patterns"; my $whole_re = qr/@patterns/; cmpthese(-5, { 'inline' => sub {$target =~ /@patterns/}, 'inline-o' => sub {$target =~ /@patterns/o}, 'grep_str' => sub {return 1 if grep $target =~ $_, @patterns} +, 'grep_RE' => sub {return 1 if grep $target =~ $_, @res}, 'whole_pat' => sub {$target =~ /$whole_pat/}, 'whole_pat-o' => sub {$target =~ /$whole_pat/o}, 'whole_re' => sub {$target =~ $whole_re}, });
      two sample outputs I get (unstable, given rand) is
      Rate grep_str grep_RE inline inline-o whole_pat-o whole_ +pat whole_re grep_str 96.6/s -- -2% -67% -67% -67% - +67% -67% grep_RE 99.1/s 3% -- -67% -67% -67% - +67% -67% inline 296/s 207% 199% -- -0% -0% +-0% -0% inline-o 296/s 207% 199% 0% -- -0% +-0% -0% whole_pat-o 297/s 207% 199% 0% 0% -- +-0% -0% whole_pat 297/s 207% 200% 0% 0% 0% + -- 0% whole_re 297/s 207% 200% 0% 0% 0% + 0% --
      Rate grep_str grep_RE inline inline-o whole_re whole_pat + whole_pat-o grep_str 97.5/s -- -2% -94% -94% -94% -94% + -94% grep_RE 99.8/s 2% -- -94% -94% -94% -94% + -94% inline 1686/s 1628% 1589% -- -0% -1% -1% + -1% inline-o 1688/s 1630% 1591% 0% -- -1% -1% + -1% whole_re 1707/s 1650% 1610% 1% 1% -- -0% + -0% whole_pat 1707/s 1650% 1610% 1% 1% 0% -- + -0% whole_pat-o 1707/s 1650% 1610% 1% 1% 0% 0% + --
      The list lengths were chosen so that the likely hood of actually getting a hit is reasonable (~80%). If we increase the pattern lengths to 10 characters so that failure is almost guaranteed, I get the following:
      Rate grep_str grep_RE inline inline-o whole_pat-o whole_p +at whole_re grep_str 169/s -- -4% -46% -46% -46% -4 +6% -46% grep_RE 177/s 5% -- -43% -43% -44% -4 +4% -44% inline 312/s 85% 77% -- -0% -0% - +0% -0% inline-o 312/s 85% 77% 0% -- -0% - +0% -0% whole_pat-o 313/s 85% 77% 0% 0% -- - +0% -0% whole_pat 313/s 86% 77% 0% 0% 0% +-- 0% whole_re 313/s 86% 77% 0% 0% 0% +0% --

      You get negligible impact from the optimization, and you break your ability to update your array of patterns (potential for bugs). You also potentially confuse people less-sophisticated people who might look at your code. There is a clear improvement over using grep, but if benchmarking shows this step is your bottleneck, then you are probably better off either optimizing your pattern or rethinking your filtering.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re^2: More than one pattern match using grep on a file
by CountZero (Bishop) on Apr 05, 2014 at 09:01 UTC
    For very simple patterns, you can squeeze out some extra speed by using Regexp::Trie. If the patterns are more complicated Regexp::Assemble or Regexp::Optimizer are something to look into.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1081134]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2014-12-27 01:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls