Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How do I quickly strip blank space from the beginning/end of a string?

by blahblahblah (Priest)
on Jul 21, 2009 at 20:47 UTC ( #782077=perlquestion: print w/ replies, xml ) Need Help??
blahblahblah has asked for the wisdom of the Perl Monks concerning the following question:

We have a subroutine that we call tens of thousands of times per page in our old, large codebase. It can't fully trust it's input (see last 3 words of previous sentence), so it always trims the whitespace from its arguments. It uses the method suggested in perlfaq4:
How do I strip blank space from the beginning/end of a string? (contributed by brian d foy) A substitution can do this for you. For a single line, you want to replace all the leading or trailing whitespace with nothing. You can do that with a pair of substitutions. s/^\s+//; s/\s+$//;
Recently I've been alerted to the fact that there are possibly much faster ways to do this. It seems japhy has written much about this, including a good overview with benchmarks at Regexes are slow (or, why I advocate String::Index). That post is pretty old, so I tried some of his benchmarks on perl 5.10. I found similar results. However, I noticed that his benchmarks all have long data, but my situation is that the data length will be varied, and often will be very short. Here are my results, once with long data and once short:

Update: As Ikegami pointed out, my benchmark had little to do with my stated question. See reply below for a new attempt at this.

use Benchmark 'cmpthese'; my $str = "alphabet X alphabet" x 100 . "junk at the end" x 10; cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, }); print "-------------- now a short one -------------------------------- +\n"; $str = "a very short one"; cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, }); __END__ [16:10] <Joe> Rate last sexeger capt_repl rx_r +x [16:10] <Joe> last 4329/s -- -78% -82% -86 +% [16:10] <Joe> sexeger 19386/s 348% -- -19% -37 +% [16:10] <Joe> capt_repl 23859/s 451% 23% -- -22 +% [16:10] <Joe> rx_rx 30778/s 611% 59% 29% - +- [16:10] <Joe> -------------- now a short one ------------------------- +------- [16:10] <Joe> Rate sexeger capt_repl rx_rx la +st [16:10] <Joe> sexeger 261969/s -- -4% -6% -7 +0% [16:10] <Joe> capt_repl 271691/s 4% -- -3% -6 +8% [16:10] <Joe> rx_rx 279271/s 7% 3% -- -6 +7% [16:10] <Joe> last 859023/s 228% 216% 208% +--
You might call this micro-optimization, but my users really would notice a couple-second difference as these scripts are running in a web app. I'm wondering if I can zero in on a particular string length where the "last" method degrades and the "segexer" becomes better. I don't have a lot of benchmarking experience, so even if I find that spot I'm not sure I'll trust my data or my methods. Just looking for advice from those who may have pondered this before or have insight that they can lend.

Thanks,
Joe

Update:crossed out non-applicable benchmark. See reply below for my latest attempt.

Comment on How do I quickly strip blank space from the beginning/end of a string?
Select or Download Code
Re: How do I quickly strip blank space from the beginning/end of a string?
by ikegami (Pope) on Jul 21, 2009 at 21:01 UTC
    Your solutions are very broken.
    • All of your solutions require the last non-space character to be an uppercase letter. You could say you're investigating a subset of the original question, except that your input has no uppercase letters. Use \s or a space instead of [^A-Z]. Use \S instead of [A-Z].

    • Some of your solutions have problems with trailing newlines due to the missing "s" modifier. /./ doesn't match newlines without it.

    • All but the last of your solutions don't work if the input is made entirely of spaces.

    There's also a issue with your benchmark. Benchmarking code that removes trailing spaces when your input never has trailing spaces is odd. Include a case where the input has trailing spaces!

      You're right, in my haste to get this question out there I made the jump from my original problem to japhy's post to talking about his benchmarks. One of my coworkers pointed out the same thing to me as I was headed out the door. Obviously I should be benchmarking the exact problem I want to solve, not some generally similar example. After I get the kids to bed I'll write a better benchmark and try your \K suggestion below too. Thanks.

      Also, you made the point that none of my input ends with spaces. I think that's generally true in real life usage too. It's frustrating that we have this pervasive idiom in our code of "strip whitespace just in case", but I think most of the time the input is already just fine. In fact, I think much of the time the input is short and has no spaces at all. I wonder if I should be checking it with index() first to quickly rule out that case.

      update: added paragraph spacing

        Also, you made the point that none of my input ends with spaces. I think that's generally true in real life usage too.

        Then shouldn't you be benchmarking space detection?

Re: How do I quickly strip blank space from the beginning/end of a string?
by ikegami (Pope) on Jul 21, 2009 at 21:04 UTC

    $x =~ /.*[A-Z]/g and $x =~ s/\G.*//
    can be written as
    $x =~ s/(.*[A-Z]).*/$1/
    and as
    $x =~ s/.*[A-Z]\K.*// (requires 5.10.0)

Re: How do I quickly strip blank space from the beginning/end of a string?
by blahblahblah (Priest) on Jul 22, 2009 at 00:51 UTC
    Here's another attempt, with code that actually solves the problem of stripping of trailing whitespace.
    use Benchmark 'cmpthese'; my $which = $ARGV[0] || 'short'; my $str = {'short' => "a very short one", 'long' => "alphabet X alphabet" x 100 . "junk at the end" x + 10, 'shortspaces' => " asdfasdf fdsdsf a sdfa sdf 3432 324 " +, }->{$which}; my $duration = 0; my $verify = 0; if ($ARGV[1] eq 'verify') { $duration = 1; $verify = 1; } cmpthese($duration, { last => sub { my $x = $str; $x =~ s/\s*$//; $method = "last"; prin +t "$x|\n" if $verify; }, plus => sub { my $x = $str; $x =~ s/\s+$//; $method = "plus"; prin +t "$x|\n" if $verify; }, rx_rx => sub { my $x = $str; $x =~ /.*\S/g and $x =~ s/\G.*//; $met +hod = "rxrx"; print "$x|\n" if $verify; }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^\s+//; $x = re +verse $x; $method = "sgxr"; print "$x|\n" if $verify; }, lookbehind => sub { my $x = $str; $x =~ s/.\K\s+$//; $method = "lkbd" +; print "$x|\n" if $verify; }, detectregex => sub { my $x = $str; if ($x =~ /\s/){($x = reverse $x) + =~ s/^\s+//; $x = reverse $x;} $method = "dtrs"; print "$x|\n" if $v +erify; }, detectsubstr => sub { my $x = $str; if (substr($x, -1) =~ /^\s/){($x + = reverse $x) =~ s/^\s+//; $x = reverse $x;} $method = "dtss"; print + "$x|\n" if $verify;\ }, }); __END__ ./test long ; ./test short ; ./test shortspaces Rate lookbehind last plus detectregex sexeger rx_r +x detectsubstr lookbehind 750/s -- -5% -86% -97% -97% -99 +% -100% last 792/s 6% -- -85% -96% -97% -99 +% -100% plus 5322/s 610% 572% -- -76% -77% -92 +% -98% detectregex 21914/s 2823% 2666% 312% -- -6% -68 +% -93% sexeger 23287/s 3006% 2839% 338% 6% -- -66 +% -92% rx_rx 67557/s 8910% 8426% 1169% 208% 190% - +- -78% detectsubstr 303483/s 40373% 38201% 5603% 1285% 1203% 349 +% -- Rate lookbehind last detectregex rx_rx plus sexeger d +etectsubstr lookbehind 81153/s -- -1% -54% -59% -67% -71% + -83% last 81683/s 1% -- -54% -58% -66% -71% + -83% detectregex 178253/s 120% 118% -- -9% -27% -37% + -63% rx_rx 196768/s 142% 141% 10% -- -19% -30% + -59% plus 243234/s 200% 198% 36% 24% -- -14% + -50% sexeger 282994/s 249% 246% 59% 44% 16% -- + -41% detectsubstr 481782/s 494% 490% 170% 145% 98% 70% + -- Rate lookbehind last plus detectregex detectsubstr rx +_rx sexeger lookbehind 35057/s -- -9% -72% -76% -77% - +80% -83% last 38419/s 10% -- -69% -74% -74% - +78% -82% plus 123667/s 253% 222% -- -15% -18% - +30% -41% detectregex 145442/s 315% 279% 18% -- -3% - +18% -31% detectsubstr 149917/s 328% 290% 21% 3% -- - +15% -29% rx_rx 176362/s 403% 359% 43% 21% 18% + -- -16% sexeger 210526/s 501% 448% 70% 45% 40% +19% --
    Using substr to check the last char really helps for the long strings; the others are not so decisive. For every case, my current method of blindly doing s/\s*$/ is the worst. Any suggestions for making it faster? Thanks.

      If most of your strings do not have trailing white space, then optimizing this case will be best. A little extra time to handle the few strings with trailing white space will not affect the average much. Do you know what percentage have trailing white space?

      String comparison is faster than a RE for detecting a space character. Is your trailing white space strictly spaces or might there be tabs, newlines or other white space characters to be removed?

      I played with a few alternatives.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://782077]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2014-09-21 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (171 votes), past polls