Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Craftier

by BatGnat (Scribe)
on Dec 15, 2000 at 03:53 UTC ( #46757=note: print w/replies, xml ) Need Help??

in reply to Craftier

Just on a side note, would it not be better to use s/\s//g; or tr/\s//;.
I just did a benchmark on the two and the regex is approx 3 times quicker than the $_ = join ' ', split;
This is the benchmarik that I ran.
use Benchmark; my $junk = 'The quick brown fox Jumped over the lazy dog'; timethese(5000000, { 'split' => '$junk = join \' \',split $junk;', 'regex1' => '$junk =~ tr/\s//;', 'regex2' => '$junk =~ s/\s//g;', });
and the results are
Benchmark: timing 5000000 iterations of regex1, regex2, split... regex1: 3 wallclock secs ( 4.38 usr + 0.00 sys = 4.38 CPU) @ 11 +42334.93/s (n=5000000) regex2: 3 wallclock secs ( 3.50 usr + 0.00 sys = 3.50 CPU) @ 14 +26940.64/s (n=5000000) split: 15 wallclock secs (15.07 usr + 0.00 sys = 15.07 CPU) @ 33 +1762.99/s (n=5000000)


Replies are listed 'Best First'.
Re: Re: Craftier
by chipmunk (Parson) on Dec 15, 2000 at 04:30 UTC
    I'm afraid there are serious problems with the Benchmark code that you posted. It is important to make sure all your code snippets do the right thing before you benchmark them, and to make sure the benchmark itself is doing the right thing. I hope it will be instructive if I detail the issues.

    There are two problems with the tr/// solution; \s is not special inside tr///, and /d is required for tr/// to delete characters. There are also two problems with the split solution; you are splitting $_ using $junk as the delimiter, and you are joining with a space instead of a null string.

    There are also problems with the benchmark itself. $junk is a lexical, so it is not accessible from the Benchmark module. Since you passed quoted strings, your snippets were compiled in the Benchmark module and were operating on an empty $junk. Once that problem is fixed, since each code snippet modifies $junk in place, only the first execution of the first snippet would have any work to do; all the remaining iterations would be processing a string that had already been stripped of whitespace.

    Here is an improved benchmark:

    #!perl use Benchmark; my $junk = 'The quick brown fox Jumped over the lazy dog'; timethese(-10, { 'split' => sub { $x = join '', split ' ', $junk; }, 'trans' => sub { ($x = $junk) =~ tr/ \t\r\n//d; }, 'subst' => sub { ($x = $junk) =~ s/\s+//g; }, });
    and the new results are:
    Benchmark: running split, subst, trans, each for at least 10 CPU secon +ds... split: 43912.46/s subst: 66211.19/s trans: 197755.00/s
    As you can see, the translation solution is actually the big winner, and the substitution is only 1.5 times as fast as split/join.
      Translation is always the fastest if you can use it, at least that's my experience.

      One thing to consider is that the alternatives suggested remove all whitespace -- the join/split just drops leading/trailing whitespace and squishes all "extra" in between down to one space -- provided you join on ' ' and not '' -- you need to do more than one regex/translation to accomplish the same thing.

      The simplest (to read) equivalent would be

      s/^\s*//; s/\s*$//; s/\s+/ /g;
      The quickest alternative would likely be to use translation to squish all the white space first and then do regexes to strip the (perhaps) remaining single spaces at the beginning and end of the string:
      tr/ \t/ /s; s/^ //; s/ $//;
      (add extra whitespace equivalents in the translation if you want to lose carriage returns and such).

      In any event, the point was that the join/split is an interesting alternative, and not all that inefficient. Lots of times when I want to strip extra whitespace speed isn't that big a deal, anyway :) Some people like to use s/^\s*|\s*$/g to strip leading and trailing whitespace but that's less efficient than doing two substitutions, so it's not always about speed.

      Perhaps I can persuade japhy to present his benchmarks?

        Ah, right, the original goal was not to remove all whitespace, but to compress whitespace. Here's a new benchmark:
        #!perl use Benchmark; my $junk = ' The quick brown fox Jumped over the lazy dog '; timethese(-10, { 'split' => sub { $x = join ' ', split ' ', $junk; }, 'trans' => sub { ($x = $junk) =~ tr/ \t\r\n/ /s; $x =~ s/^ //; $x =~ s/ $//; }, 'subst' => sub { ($x = $junk) =~ s/\s+/ /g; $x =~ s/^ //; $x =~ s/ $//; }, 'subst2' => sub { ($x = $junk) =~ s/^\s+//; $x =~ s/\s+$//; $x =~ s/\s+/ /g; }, });
        And the results:
        Benchmark: running split, subst, subst2, trans, each for at least 10 CPU seconds... split: 41225.50/s subst: 40796.61/s subst2: 38222.28/s trans: 72880.42/s
        split fares much better when the task is to compress whitespace, beating either substitution solution, but translate is still the winner. The extra substitutions to remove whitespace at the beginning and the end of the string slow down quite a bit the solutions which require them.
      Sorry for the incorrect posting, but either way, you proved my point. As for the space in the split, I copied that directly from his code, $_ = join ' ', split; and modified it. I didn't even look to see if his code was wrong, I should have checked.
      Thanks for the help, I have only started use Benchmark recently.

      Micro$ofts new corporate motto: RESISTENCE IS FUTILE
        Er, 2 things, originally it said:
        (Note: you need to use $x = join ' ', split ' ', $x; if your string isn't in $_.)
        and '\s' isn't special inside of tr///? I didn't know that? How come?


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://46757]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2017-01-16 15:25 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (151 votes). Check out past polls.