comment on

but I'm concerned about speed. If its doing this for ever file on a terabyte server I'm worried about the time consumption. What do you think?

Just the fact that you hide a loop as regexp alternatives doesn't mean it's suddenly orders of a magnitude faster. In fact, it might as well be that splitting the regexp in smaller chunks is faster, because the optimizer kicks in.

Here's a benchmark:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw /cmpthese/;
                 
our @regexes = (
    '.*\.jpg$',
    '.*\.png$',
    'Perl',
    '\.mozilla/abigail',
);
                     
our @words = `find /home/abigail`;  # 38517 files.
our ($c1, $c2);
                    
cmpthese -60 => {
    single   => 'my $regex = join "|" => @regexes;
                 $c1 = 0;
                 for my $w (@words) {
                     $c1 ++ if $w =~ /$regex/
                 }',
     many    => '$c2 = 0;
               WORD:
                 for my $w (@words) {
                     for my $r (@regexes) {
                         $c2 ++, next WORD if $w =~ /$r/
                     }
                 }',
};
    
die "Unequal\n" unless $c1 == $c2;
                     
__END__
       s/iter single   many
single   4.86     --   -74%
many     1.28   281%     --
[download]

Now, for your particular data set results might be different. But don't assume alternatives are necessarely slower.

Abigail

In reply to Re: Returning regexp pattern that was used to match by Abigail-II
in thread Returning regexp pattern that was used to match by crabbdean

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks