Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: multiple substitution

by aaron_baugher (Deacon)
on Aug 25, 2012 at 16:39 UTC ( #989736=note: print w/ replies, xml ) Need Help??


in reply to Re: multiple substitution
in thread multiple substitution

I answered a similar question recently with a loop:

$s =~ s/$_/$h{$_}/g for keys %h;

So I wondered how that would compare to your solution of combining the searches into a single regex. I thought your way might win for a few words, but surely with a lot of words the complexity of the regex would slow it down, right?

Well, so much for that theory. The Perl regex engine continues to amaze me. I gave it a pattern combining 676 strings (all two-letter combinations) with pipes like yours, and it blew the forloop method away (92 times faster). It also beat a regex solution using Regexp::Assemble, but I was using very simple and known search strings, so the hand-made pipe method was safe and simple. With unknown or more complex strings, making it harder to hand-make a safe and efficient search pattern, I think RA would probably come out on top eventually. Anyway, my test and results:

abaugher@bannor> cat 989705.pl #!/usr/bin/env perl use Modern::Perl; use Benchmark qw(:all); use Regexp::Assemble; my %h = map { $_ => uc } ( 'aa' .. 'zz' ); my $s = `cat bigfile`; # 8MB file say "Testing with @{[-s 'bigfile']} byte file and @{[ scalar keys %h ] +} patterns"; cmpthese( 10, { 'forloop' => \&forloop, 'pipes' => \&pipes, 'regexpa' => \&regexpa, }); sub forloop { $s =~ s/$_/$h{$_}/g for keys %h; } sub pipes { my $p = join '|', keys %h; $s =~ s/($p)/$h{$1}/g; } sub regexpa { my $p = Regexp::Assemble->new->add(keys %h)->re; $s =~ s/($p)/$h{$1}/g; } abaugher@bannor> perl 989705.pl Testing with 8560854 byte file and 676 patterns Rate forloop regexpa pipes forloop 9.75e-02/s -- -96% -99% regexpa 2.40/s 2364% -- -74% pipes 9.08/s 9213% 278% --

Aaron B.
Available for small or large Perl jobs; see my home node.


Comment on Re^2: multiple substitution
Select or Download Code
Re^3: multiple substitution
by Corion (Pope) on Aug 25, 2012 at 16:48 UTC

    I only (re)used what the OP had as a regular expression already. But your results mesh well with When Perl Isn't Quite Fast Enough - the less ops you need, and the more you can do within the RE engine, the faster your Perl code is.

Re^3: multiple substitution
by AnomalousMonk (Abbot) on Aug 25, 2012 at 18:11 UTC

    The  pipes() and  regexpa() functions used in the timing loops above both include generation of the matching regexes in each loop execution. I doubt it adds greatly to the overall execution time, but is it proper to include regex generation in the timing of a substitution operation?

    On a more critical note, a substitution is done on the  $s string in each repetition of each timing loop, but will there be anything to be found for substitution after the first pass of whatever timing function happens to be executed first? Are not all subsequent passes in all functions just comparing the time it takes for a regex to find no match in a string? (Maybe take the 8MB file content and  x   it into three identical 200 - 500MB strings and do just one comparison pass of substitutions on each string.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://989736]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-12-25 11:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (160 votes), past polls