Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^2: multiple substitution

by aaron_baugher (Curate)
on Aug 25, 2012 at 16:39 UTC ( #989736=note: print w/replies, xml ) Need Help??

in reply to Re: multiple substitution
in thread multiple substitution

I answered a similar question recently with a loop:

$s =~ s/$_/$h{$_}/g for keys %h;

So I wondered how that would compare to your solution of combining the searches into a single regex. I thought your way might win for a few words, but surely with a lot of words the complexity of the regex would slow it down, right?

Well, so much for that theory. The Perl regex engine continues to amaze me. I gave it a pattern combining 676 strings (all two-letter combinations) with pipes like yours, and it blew the forloop method away (92 times faster). It also beat a regex solution using Regexp::Assemble, but I was using very simple and known search strings, so the hand-made pipe method was safe and simple. With unknown or more complex strings, making it harder to hand-make a safe and efficient search pattern, I think RA would probably come out on top eventually. Anyway, my test and results:

abaugher@bannor> cat #!/usr/bin/env perl use Modern::Perl; use Benchmark qw(:all); use Regexp::Assemble; my %h = map { $_ => uc } ( 'aa' .. 'zz' ); my $s = `cat bigfile`; # 8MB file say "Testing with @{[-s 'bigfile']} byte file and @{[ scalar keys %h ] +} patterns"; cmpthese( 10, { 'forloop' => \&forloop, 'pipes' => \&pipes, 'regexpa' => \&regexpa, }); sub forloop { $s =~ s/$_/$h{$_}/g for keys %h; } sub pipes { my $p = join '|', keys %h; $s =~ s/($p)/$h{$1}/g; } sub regexpa { my $p = Regexp::Assemble->new->add(keys %h)->re; $s =~ s/($p)/$h{$1}/g; } abaugher@bannor> perl Testing with 8560854 byte file and 676 patterns Rate forloop regexpa pipes forloop 9.75e-02/s -- -96% -99% regexpa 2.40/s 2364% -- -74% pipes 9.08/s 9213% 278% --

Aaron B.
Available for small or large Perl jobs; see my home node.

Replies are listed 'Best First'.
Re^3: multiple substitution
by AnomalousMonk (Chancellor) on Aug 25, 2012 at 18:11 UTC

    The  pipes() and  regexpa() functions used in the timing loops above both include generation of the matching regexes in each loop execution. I doubt it adds greatly to the overall execution time, but is it proper to include regex generation in the timing of a substitution operation?

    On a more critical note, a substitution is done on the  $s string in each repetition of each timing loop, but will there be anything to be found for substitution after the first pass of whatever timing function happens to be executed first? Are not all subsequent passes in all functions just comparing the time it takes for a regex to find no match in a string? (Maybe take the 8MB file content and  x   it into three identical 200 - 500MB strings and do just one comparison pass of substitutions on each string.)

Re^3: multiple substitution
by Corion (Pope) on Aug 25, 2012 at 16:48 UTC

    I only (re)used what the OP had as a regular expression already. But your results mesh well with When Perl Isn't Quite Fast Enough - the less ops you need, and the more you can do within the RE engine, the faster your Perl code is.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://989736]
[erix]: ah! Those germans again! ... they have a lot to answer for ;-)

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2018-06-19 09:08 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (111 votes). Check out past polls.