Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Ways to delete start of string

by hsmyers (Canon)
on May 24, 2008 at 16:00 UTC ( [id://688308]=perlquestion: print w/replies, xml ) Need Help??

hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

While refactoring some code I got to wondering about 'what is the fastest way to remove the first character in a string?' So I cobbled together some code to find out. I had been using s/.//; but I am admittedly prone to regex abuse. I found the results to be a little surprising and wondered if anyone else had different entries for the putative horse race?
#!/perl/bin/perl # # bench.pl -- use strict; use warnings; use diagnostics; use Benchmark qw(:all); sub caseone { $_ = '|0|0|0|0|0|0|'; s/.//; } sub casetwo { $_ = '|0|0|0|0|0|0|'; substr($_,0,1) = ''; } sub casethree { $_ = '|0|0|0|0|0|0|'; $_ = reverse $_; chop; $_ = reverse $_; } sub casefour { $_ = '|0|0|0|0|0|0|'; $_ = substr($_,1); } cmpthese(-5,{ 'caseone' => 'caseone', 'casetwo' => 'casetwo', 'casethree' => 'casethree', 'casefour' => 'casefour', });
And the results look like:
Rate caseone casethree casetwo casefour caseone 293156/s -- -12% -21% -32% casethree 331364/s 13% -- -10% -23% casetwo 369954/s 26% 12% -- -14% casefour 430979/s 47% 30% 16% --

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

Replies are listed 'Best First'.
Re: Ways to delete start of string
by moritz (Cardinal) on May 24, 2008 at 16:28 UTC
    The speed depends on the length of the string.
    # string length 2 * 10: Rate reverse regex substr_mod susbtr_copy reverse 797484/s -- -29% -30% -41% regex 1125463/s 41% -- -1% -16% substr_mod 1131927/s 42% 1% -- -16% susbtr_copy 1344489/s 69% 19% 19% -- # string length 2 * 100_000: Rate reverse susbtr_copy regex substr_mod reverse 1385/s -- -51% -60% -80% susbtr_copy 2847/s 106% -- -17% -58% regex 3437/s 148% 21% -- -49% substr_mod 6771/s 389% 138% 97% -- # with 2 * 1e7: Rate reverse susbtr_copy regex substr_mod reverse 8.61/s -- -54% -58% -81% susbtr_copy 18.7/s 117% -- -8% -58% regex 20.4/s 137% 9% -- -54% substr_mod 44.6/s 418% 139% 119% --

    The speed doesn't really depend on the perl version (I tried 5.8.8 and 5.10.0).

      'reverse' seems fairly intuitive, longer the string the greater the work (times two in fact). 'regex' looks like it is seeking constant time of sorts. I'm clueless on the flip flop with 'substr_mod' and 'substr_copy'.

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
        If you really want to know, look in pp.c in the perl source, in blead it's in lines 3055 to 3223.

        I skimmed it quickly and still have no clue how it does things - too many macros, too little knowledge from my part.

      The time needed to perform $x =~ s/.//; and substr($x,0,1) = ''; is not related to the length of $x thanks to the "OOK" optimization.

      Instead of

      1. allocating a new buffer,
      2. copying the the string to the new buffer minus the leading char,
      3. assigning the new buffer to the variable and
      4. freeing the old buffer

      those two operations

      1. increment the pointer to the buffer in the variable,
      2. assign 1 to the IV slot* of the variable if the OOK flag is off or
      3. increment the IV slot* of the variable if the OOK flag is on, and
      4. turn on the OOK flag

      The string is never copied. You can see this in effect in the following snippet:

      >perl -MDevel::Peek -e"my $x='abcdef'; Dump($x); substr($x,0,1)=''; Du +mp($x);" SV = PV(0x226104) at 0x2252e8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x182ca64 "abcdef"\0 CUR = 6 LEN = 8 SV = PVIV(0x227134) at 0x2252e8 REFCNT = 2 FLAGS = (PADBUSY,PADMY,POK,OOK,pPOK) IV = 1 (OFFSET) PV = 0x182ca65 ( "a" . ) "bcdef"\0 CUR = 5 LEN = 7

      If POK is true and OOK is false, then
      string start = PV
      string length = CUR
      number bytes allocated = LEN
      start of buffer = PV

      If POK is true and OOK is true, then
      string start = PV
      string length = CUR
      number bytes allocated = LEN + IV
      start of buffer = PV - IV

      * — Nicholas Clark recently made a change to Perl so that the chopped bytes are used instead of the IV slot. That hasn't appeared in any Perl release yet.

      Update: Changed substr($x,0,1,''); to substr($x,0,1)=''; since the rest of the thread used the latter.

Re: Ways to delete start of string
by chromatic (Archbishop) on May 24, 2008 at 16:28 UTC

    As written:

    Rate casethree casetwo caseone casefour casethree 890174/s -- -3% -26% -34% casetwo 919255/s 3% -- -24% -32% caseone 1205935/s 35% 31% -- -11% casefour 1357009/s 52% 48% 13% --

    Moving casefour to the start of the file (as the first declared subroutine):

    Rate casethree casetwo casefour caseone casethree 836916/s -- -8% -31% -33% casetwo 908163/s 9% -- -25% -28% casefour 1215803/s 45% 34% -- -3% caseone 1255655/s 50% 38% 3% --

    Moving casetwo to the start of the file:

    Rate casethree casetwo caseone casefour casethree 870542/s -- -4% -27% -36% casetwo 908848/s 4% -- -23% -33% caseone 1187190/s 36% 31% -- -13% casefour 1362630/s 57% 50% 15% --

    This is Perl 5.8.8.

      If you just rearrange the hash to reverse order from the original you also get different answers:
      Rate caseone casethree casetwo casefour caseone 304004/s -- -10% -20% -34% casethree 337272/s 11% -- -11% -26% casetwo 379359/s 25% 12% -- -17% casefour 458814/s 51% 36% 21% --

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Ways to delete start of string
by mwah (Hermit) on May 24, 2008 at 22:38 UTC

    After reading the other remarks here and (wildly) guessing that your string invocation and copying may look not look realistic for tasks usually done with character substitutions, I tried to make the test more illustrative - (re-ordered your code and added some meaningful names ;-)

    The output is subsequently generated for different string sizes, from 2x10^1 to 2x10^4 bytes of length:

    use strict; # Purpose: In each benchmark invocatio +n, have one use warnings; # (constant) string copied to another- + which is then # modified (shortened by the first cha +racter) and for my $n (1..4) { # touched again (length determined and + compared) use Benchmark qw(cmpthese); my $org_str = '|0' x 10**$n; # generate the string in local scope my $mod_str = $org_str; # do some allocation on the other stri +ng's PV print "string length: " . length($org_str) . "\n"; cmpthese( -3, { regexsubst => sub { # copy and modify ($mod_str = $org_str) =~ s/.//; die unless length($mod_str)+1 == length($org_str +) }, substr_rhs => sub { # there's no point full string copy, simply c +opy what's needed $mod_str = substr($org_str, 1); die unless length($mod_str)+1 == length($org_str +) }, substr_lhs => sub { # copy and modify substr($mod_str = $org_str, 0, 1) = ''; die unless length($mod_str)+1 == length($org_str +) }, reversestr => sub { # reverse, copy, modify, reverse chop($mod_str = reverse($org_str)); $mod_str = reverse $mod_str; die unless length($mod_str)+1 == length($org_str +) } } ); print '- ' x 30, "\n" }

    On my machine (5.10), the right-side substr() wins almost always (if the string in question is not longer than some KB), as the reverse-chop-reverse looses. The regex-subst performance approaches the substr speed asymptotically as the string gets longer - but seems to be slower as the left-side-substr() on the shortest string tested.

    The funny part is: the left-side substr() will beat the right side substr() if the string exceeds some (larger) size. I wouldn't have thought of this one!

    Results:

    Update:

    After studying Moritz' Code (there is one - but hidden behind readmore-tags ;-), the results are compatible (he tested long strings too), but Moritz wasn't curious about it - so it might come as expected.

    Apparently, there seems to be no 'flip-flop', as hsmyers suggested - the left-side substr() simply takes over on longer strings (above several KB).

    my € 0.02

    mwa

      Both the 4-arg substr (I find left-side a confusing term) and s/// avoid copying, so should be more or less constant. They do this by just adjusting the beginning pointer into the string buffer and noting the offset used in the slot usually used for integer value. This is called the OOK hack (OOK being the flag set to indicate the integer slot is storing an offset).
        This is called the OOK hack ...

        Thanks for hinting on this, after your explanation I can see the pattern now.

        I find left-side a confusing term

        I was under the impression that 'left hand side term' vs. 'right hand side ...' would be a terminus technicus here (?)

        Thanks & Regards

        mwa

      I'm just looking at the results; and there is a flip-flop from first invocation to second to third. Or am I blind?

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
        and there is a flip-flop from first invocation to second to third.

        Maybe that's a semantical thing missed by me. Flip-flop implies, imho, two changes of positions - so I thought you spoke about these "two changes" - my mistake.

        Regards

        mwa

Re: Ways to delete start of string (The /s modifier)
by lodin (Hermit) on May 25, 2008 at 19:40 UTC

    A nit-pick, but the substitution should really be

    s/.//s
    to be equivalent with the others. Note the s modifier; it makes . match leading newlines as well. Otherwise you remove the first non-newline character from the string.
    my %ex = ( "\n" . 'regexp' => sub { s/.// }, "\n" . 'regexp/s' => sub { s/.//s }, "\n" . 'chop' => sub { $_ = reverse; chop; $_ = reverse; }, "\n" . 'substr1' => sub { substr($_,0,1) = '' }, "\n" . 'substr2' => sub { $_ = substr($_,1) }, ); for (sort keys %ex) { $ex{$_}->(); print "[$_]\n"; } __END__ [chop] [ egexp] [regexp/s] [substr1] [substr2]

    Update: Added example.

    lodin

      Color me confused. The goal is to eliminate the first character. Given that removing 'the first non-newline character' is precisely what is desired. Please explain what you mean. Unless of course you are thinking that a newline might be the first character in the general case. Then that would be correct. Admittedly that never occurred to me. For no particular reason I tend to thing of newlines occurring at the end of a string and usually removed by chop/chomp at that. Of course there are times when I'm dealing with an entire as is buffer in which case there would be an expectation of embedded newlines but even at that it seems unlikely that they would wind up in a string that I would want to do a reverse chop on. Like I said please explain...

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Ways to delete start of string
by ikegami (Patriarch) on May 27, 2008 at 02:28 UTC

    I just found out the time taken by a string assignment is not constant for a given argument. It's dependent on the previous state of the variable to which the string is assigned.

    In your test, the time taken by $_ = '|0|0|0|0|0|0|'; is not constant because the previous state of $_ isn't constant. That means you aren't testing what you think you are testing. Using a lexical instead of $_ solves that problem.

    Your tests really shouldn't be in subs either. They add a serious overhead, especially since your data is so small.

    use strict; use warnings; use Benchmark qw( cmpthese ); my %tests = ( subst => '$x =~ s/.//;', substr_lval => 'substr($x,0,1) = "";', substr_mod => 'substr($x,0,1,"");', reverse => '$x = reverse $x; chop($x); $x = reverse($x);', substr_copy => '$x = substr($x,1);', ); for (values %tests) { $_ = 'use strict; use warnings; my $x = "|0|0|0|0|0|0|"; ' . $_; } cmpthese(-5, \%tests);

      You are still testing subroutine call speed rather than the snippets you purport to be testing.

      Whatever code snippets you supply to benchmark, get wrapped internally into subs (See Benchmark::runloop.)

      By using strings instead of subs, you have removed one layer of indirection, but you are still swamping the time taken for the code under test, by the time taken to invoke the subroutine that gets wrapped around it.

      The only way to get anything like an accurate measurement for this type of micro-benchmark, is to add a multiplier loop inside the subroutine Benchmark constructs, so as to amortise the costs of calling that sub over a large number of iterations, to give a+(k/1e4) ~= b/(k/1e4). (* or whatever multiplier is appropriate.)

      Also, I'm not sure what the cost of use strict; and use warnings is, when they have already been loaded, but there must be some if only to discover they are already loaded plus the calling of (or attempted call of) their import subs.

      As Benchmark already adds use strict to the subs it constructs, that's pure duplication. And as it already has use warnings in force internally, when it eval's the subs into existance, I don't think you are gaining anything by adding it to the code that gets eval'd. You are simply mudding the waters further by adding another fixed cost to the tests.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Whatever code snippets you supply to benchmark, get wrapped internally into subs

        That's true. That's why I usually do

        $_ = "use strict; use warnings; for (1..10_000) { my \$x = '|0|0|0|0|0 +|0|'; $_ }";

        to minimize the cost of that sub call.

        Also, I'm not sure what the cost of use strict; and use warnings is, when they have already been loaded,

        Zero. use is executed once, at compile-time. It doesn't generate any code in the tree.

        >perl -MO=Concise -e"use strict; print 'a'" 6 <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 2 -e:1) v/2 ->3 5 <@> print vK ->6 3 <0> pushmark s ->4 4 <$> const[PV "a"] s ->5 -e syntax OK
        'purport'! You are suggesting that I'm not testing? Or is this one of those pond translation errors? Must be since I'm clearly posting results. Good point about strict and warnings They are an artifact of a 'new' file in my editor. Could you give an example of 'add a multiplier loop inside the subroutine'? While I still insist that I'm looking for better ways to pre-chop, learning about benchmarking is both fascinating and useful.

        --hsm

        "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
      The results of this formulation are:
      Rate substr_lval subst reverse substr_mod s +ubstr_copy substr_lval 489949/s -- -9% -37% -67% + -69% subst 539218/s 10% -- -31% -63% + -66% reverse 776376/s 58% 44% -- -47% + -52% substr_mod 1473954/s 201% 173% 90% -- + -8% substr_copy 1606272/s 228% 198% 107% 9% + --
      Not hard to see the problem with $_. That said I'm not sure that what you say about how things shouldn't be in a sub is correct. Shouldn't it factor out since it would be true for all cases?

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."

        Shouldn't it factor out since it would be true for all cases?

        No. (a+k)/(b+k) is not equal to a/b. And it diminishes the value of the absolute numbers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://688308]
Approved by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-26 09:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found