Ways to delete start of string

hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Ways to delete start of string by moritz (Cardinal) on May 24, 2008 at 16:28 UTC
The speed depends on the length of the string. Read more... (857 Bytes) # string length 2 * 10: Rate reverse regex substr_mod susbtr_copy reverse 797484/s -- -29% -30% -41% regex 1125463/s 41% -- -1% -16% substr_mod 1131927/s 42% 1% -- -16% susbtr_copy 1344489/s 69% 19% 19% -- # string length 2 * 100_000: Rate reverse susbtr_copy regex substr_mod reverse 1385/s -- -51% -60% -80% susbtr_copy 2847/s 106% -- -17% -58% regex 3437/s 148% 21% -- -49% substr_mod 6771/s 389% 138% 97% -- # with 2 * 1e7: Rate reverse susbtr_copy regex substr_mod reverse 8.61/s -- -54% -58% -81% susbtr_copy 18.7/s 117% -- -8% -58% regex 20.4/s 137% 9% -- -54% substr_mod 44.6/s 418% 139% 119% -- [download] The speed doesn't really depend on the perl version (I tried 5.8.8 and 5.10.0).	[reply] [d/l] [select]
Re^2: Ways to delete start of string by hsmyers (Canon) on May 24, 2008 at 20:25 UTC
'reverse' seems fairly intuitive, longer the string the greater the work (times two in fact). 'regex' looks like it is seeking constant time of sorts. I'm clueless on the flip flop with 'substr_mod' and 'substr_copy'. --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]
Re^3: Ways to delete start of string by moritz (Cardinal) on May 24, 2008 at 20:39 UTC
If you really want to know, look in `pp.c` in the perl source, in blead it's in lines 3055 to 3223. I skimmed it quickly and still have no clue how it does things - too many macros, too little knowledge from my part.	[reply] [d/l]
Re^4: Ways to delete start of string by hsmyers (Canon) on May 24, 2008 at 22:33 UTC
Re^2: Ways to delete start of string (OOK) by ikegami (Patriarch) on May 27, 2008 at 03:14 UTC
The time needed to perform `$x =~ s/.//;` and `substr($x,0,1) = '';` is not related to the length of $x thanks to the "OOK" optimization. Instead of allocating a new buffer, copying the the string to the new buffer minus the leading char, assigning the new buffer to the variable and freeing the old buffer those two operations increment the pointer to the buffer in the variable, assign 1 to the IV slot* of the variable if the OOK flag is off or increment the IV slot* of the variable if the OOK flag is on, and turn on the OOK flag The string is never copied. You can see this in effect in the following snippet: `>perl -MDevel::Peek -e"my $x='abcdef'; Dump($x); substr($x,0,1)=''; Du +mp($x);" SV = PV(0x226104) at 0x2252e8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x182ca64 "abcdef"\0 CUR = 6 LEN = 8 SV = PVIV(0x227134) at 0x2252e8 REFCNT = 2 FLAGS = (PADBUSY,PADMY,POK,OOK,pPOK) IV = 1 (OFFSET) PV = 0x182ca65 ( "a" . ) "bcdef"\0 CUR = 5 LEN = 7` [download] If POK is true and OOK is false, then string start = PV string length = CUR number bytes allocated = LEN start of buffer = PV If POK is true and OOK is true, then string start = PV string length = CUR number bytes allocated = LEN + IV start of buffer = PV - IV * — Nicholas Clark recently made a change to Perl so that the chopped bytes are used instead of the IV slot. That hasn't appeared in any Perl release yet. Update: Changed `substr($x,0,1,'');` to `substr($x,0,1)='';` since the rest of the thread used the latter.	[reply] [d/l] [select]
Re: Ways to delete start of string by chromatic (Archbishop) on May 24, 2008 at 16:28 UTC
As written: `Rate casethree casetwo caseone casefour casethree 890174/s -- -3% -26% -34% casetwo 919255/s 3% -- -24% -32% caseone 1205935/s 35% 31% -- -11% casefour 1357009/s 52% 48% 13% --` [download] Moving `casefour` to the start of the file (as the first declared subroutine): `Rate casethree casetwo casefour caseone casethree 836916/s -- -8% -31% -33% casetwo 908163/s 9% -- -25% -28% casefour 1215803/s 45% 34% -- -3% caseone 1255655/s 50% 38% 3% --` [download] Moving `casetwo` to the start of the file: `Rate casethree casetwo caseone casefour casethree 870542/s -- -4% -27% -36% casetwo 908848/s 4% -- -23% -33% caseone 1187190/s 36% 31% -- -13% casefour 1362630/s 57% 50% 15% --` [download] This is Perl 5.8.8.	[reply] [d/l] [select]
Re^2: Ways to delete start of string by hsmyers (Canon) on May 24, 2008 at 20:22 UTC
If you just rearrange the hash to reverse order from the original you also get different answers: `Rate caseone casethree casetwo casefour caseone 304004/s -- -10% -20% -34% casethree 337272/s 11% -- -11% -26% casetwo 379359/s 25% 12% -- -17% casefour 458814/s 51% 36% 21% --` [download] --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply] [d/l]
Re: Ways to delete start of string by mwah (Hermit) on May 24, 2008 at 22:38 UTC
After reading the other remarks here and (wildly) guessing that your string invocation and copying may look not look realistic for tasks usually done with character substitutions, I tried to make the test more illustrative - (re-ordered your code and added some meaningful names ;-) The output is subsequently generated for different string sizes, from 2x10^1 to 2x10^4 bytes of length: use strict; # Purpose: In each benchmark invocatio +n, have one use warnings; # (constant) string copied to another- + which is then # modified (shortened by the first cha +racter) and for my $n (1..4) { # touched again (length determined and + compared) use Benchmark qw(cmpthese); my $org_str = '\|0' x 10**$n; # generate the string in local scope my $mod_str = $org_str; # do some allocation on the other stri +ng's PV print "string length: " . length($org_str) . "\n"; cmpthese( -3, { regexsubst => sub { # copy and modify ($mod_str = $org_str) =~ s/.//; die unless length($mod_str)+1 == length($org_str +) }, substr_rhs => sub { # there's no point full string copy, simply c +opy what's needed $mod_str = substr($org_str, 1); die unless length($mod_str)+1 == length($org_str +) }, substr_lhs => sub { # copy and modify substr($mod_str = $org_str, 0, 1) = ''; die unless length($mod_str)+1 == length($org_str +) }, reversestr => sub { # reverse, copy, modify, reverse chop($mod_str = reverse($org_str)); $mod_str = reverse $mod_str; die unless length($mod_str)+1 == length($org_str +) } } ); print '- ' x 30, "\n" } [download] On my machine (5.10), the right-side substr() wins almost always (if the string in question is not longer than some KB), as the reverse-chop-reverse looses. The regex-subst performance approaches the substr speed asymptotically as the string gets longer - but seems to be slower as the left-side-substr() on the shortest string tested. The funny part is: the left-side substr() will beat the right side substr() if the string exceeds some (larger) size. I wouldn't have thought of this one! Results: Read more... (2 kB) Update: After studying Moritz' Code (there is one - but hidden behind readmore-tags ;-), the results are compatible (he tested long strings too), but Moritz wasn't curious about it - so it might come as expected. Apparently, there seems to be no 'flip-flop', as hsmyers suggested - the left-side substr() simply takes over on longer strings (above several KB). my € 0.02 mwa	[reply] [d/l] [select]
Re^2: Ways to delete start of string by ysth (Canon) on May 25, 2008 at 06:13 UTC
Both the 4-arg substr (I find left-side a confusing term) and s/// avoid copying, so should be more or less constant. They do this by just adjusting the beginning pointer into the string buffer and noting the offset used in the slot usually used for integer value. This is called the OOK hack (OOK being the flag set to indicate the integer slot is storing an offset). -- Online Fortune Cookie Search	[reply]
Re^3: Ways to delete start of string by mwah (Hermit) on May 25, 2008 at 18:12 UTC
This is called the OOK hack ... Thanks for hinting on this, after your explanation I can see the pattern now. I find left-side a confusing term I was under the impression that 'left hand side term' vs. 'right hand side ...' would be a terminus technicus here (?) Thanks & Regards mwa	[reply]
Re^4: Ways to delete start of string by ikegami (Patriarch) on May 27, 2008 at 03:38 UTC
Re^2: Ways to delete start of string by hsmyers (Canon) on May 24, 2008 at 23:45 UTC
I'm just looking at the results; and there is a flip-flop from first invocation to second to third. Or am I blind? --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]
Re^3: Ways to delete start of string by mwah (Hermit) on May 25, 2008 at 18:04 UTC
and there is a flip-flop from first invocation to second to third. Maybe that's a semantical thing missed by me. Flip-flop implies, imho, two changes of positions - so I thought you spoke about these "two changes" - my mistake. Regards mwa	[reply]
Re: Ways to delete start of string (The /s modifier) by lodin (Hermit) on May 25, 2008 at 19:40 UTC
A nit-pick, but the substitution should really be `s/.//s` [download] to be equivalent with the others. Note the s modifier; it makes `.` match leading newlines as well. Otherwise you remove the first non-newline character from the string. `my %ex = ( "\n" . 'regexp' => sub { s/.// }, "\n" . 'regexp/s' => sub { s/.//s }, "\n" . 'chop' => sub { $_ = reverse; chop; $_ = reverse; }, "\n" . 'substr1' => sub { substr($_,0,1) = '' }, "\n" . 'substr2' => sub { $_ = substr($_,1) }, ); for (sort keys %ex) { $ex{$_}->(); print "[$_]\n"; } __END__ [chop] [ egexp] [regexp/s] [substr1] [substr2]` [download] Update: Added example. lodin	[reply] [d/l] [select]
Re^2: Ways to delete start of string (The /s modifier) by hsmyers (Canon) on May 26, 2008 at 00:23 UTC
Color me confused. The goal is to eliminate the first character. Given that removing 'the first non-newline character' is precisely what is desired. Please explain what you mean. Unless of course you are thinking that a newline might be the first character in the general case. Then that would be correct. Admittedly that never occurred to me. For no particular reason I tend to thing of newlines occurring at the end of a string and usually removed by `chop/chomp` at that. Of course there are times when I'm dealing with an entire as is buffer in which case there would be an expectation of embedded newlines but even at that it seems unlikely that they would wind up in a string that I would want to do a reverse chop on. Like I said please explain... --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply] [d/l]
Re: Ways to delete start of string by ikegami (Patriarch) on May 27, 2008 at 02:28 UTC
I just found out the time taken by a string assignment is not constant for a given argument. It's dependent on the previous state of the variable to which the string is assigned. In your test, the time taken by `$_ = '\|0\|0\|0\|0\|0\|0\|';` is not constant because the previous state of `$_` isn't constant. That means you aren't testing what you think you are testing. Using a lexical instead of `$_` solves that problem. Your tests really shouldn't be in subs either. They add a serious overhead, especially since your data is so small. `use strict; use warnings; use Benchmark qw( cmpthese ); my %tests = ( subst => '$x =~ s/.//;', substr_lval => 'substr($x,0,1) = "";', substr_mod => 'substr($x,0,1,"");', reverse => '$x = reverse $x; chop($x); $x = reverse($x);', substr_copy => '$x = substr($x,1);', ); for (values %tests) { $_ = 'use strict; use warnings; my $x = "\|0\|0\|0\|0\|0\|0\|"; ' . $_; } cmpthese(-5, \%tests);` [download]	[reply] [d/l] [select]
Re^2: Ways to delete start of string by BrowserUk (Patriarch) on May 27, 2008 at 10:13 UTC
You are still testing subroutine call speed rather than the snippets you purport to be testing. Whatever code snippets you supply to benchmark, get wrapped internally into subs (See Benchmark`::runloop`.) By using strings instead of subs, you have removed one layer of indirection, but you are still swamping the time taken for the code under test, by the time taken to invoke the subroutine that gets wrapped around it. The only way to get anything like an accurate measurement for this type of micro-benchmark, is to add a multiplier loop inside the subroutine Benchmark constructs, so as to amortise the costs of calling that sub over a large number of iterations, to give `a+(k/1e4)` ~= `b/(k/1e4)`. (* or whatever multiplier is appropriate.) Also, I'm not sure what the cost of `use strict;` and `use warnings` is, when they have already been loaded, but there must be some if only to discover they are already loaded plus the calling of (or attempted call of) their `import` subs. As Benchmark already adds `use strict` to the subs it constructs, that's pure duplication. And as it already has use warnings in force internally, when it eval's the subs into existance, I don't think you are gaining anything by adding it to the code that gets eval'd. You are simply mudding the waters further by adding another fixed cost to the tests. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: Ways to delete start of string by ikegami (Patriarch) on May 27, 2008 at 19:23 UTC
Whatever code snippets you supply to benchmark, get wrapped internally into subs That's true. That's why I usually do `$_ = "use strict; use warnings; for (1..10_000) { my \$x = '\|0\|0\|0\|0\|0 +\|0\|'; $_ }";` [download] to minimize the cost of that sub call. Also, I'm not sure what the cost of use strict; and use warnings is, when they have already been loaded, Zero. `use` is executed once, at compile-time. It doesn't generate any code in the tree. `>perl -MO=Concise -e"use strict; print 'a'" 6 <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 2 -e:1) v/2 ->3 5 <@> print vK ->6 3 <0> pushmark s ->4 4 <$> const[PV "a"] s ->5 -e syntax OK` [download]	[reply] [d/l] [select]
Re^3: Ways to delete start of string by hsmyers (Canon) on May 27, 2008 at 17:12 UTC
'purport'! You are suggesting that I'm not testing? Or is this one of those pond translation errors? Must be since I'm clearly posting results. Good point about `strict` and `warnings` They are an artifact of a 'new' file in my editor. Could you give an example of 'add a multiplier loop inside the subroutine'? While I still insist that I'm looking for better ways to pre-chop, learning about benchmarking is both fascinating and useful. --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply] [d/l] [select]
Re^4: Ways to delete start of string by BrowserUk (Patriarch) on May 27, 2008 at 18:39 UTC
Re^5: Ways to delete start of string by hsmyers (Canon) on May 28, 2008 at 02:03 UTC
Re^4: Ways to delete start of string by kyle (Abbot) on May 27, 2008 at 17:36 UTC
Re^5: Ways to delete start of string (micro ops) by tye (Sage) on May 27, 2008 at 18:00 UTC
Re^2: Ways to delete start of string by hsmyers (Canon) on May 27, 2008 at 06:47 UTC
The results of this formulation are: `Rate substr_lval subst reverse substr_mod s +ubstr_copy substr_lval 489949/s -- -9% -37% -67% + -69% subst 539218/s 10% -- -31% -63% + -66% reverse 776376/s 58% 44% -- -47% + -52% substr_mod 1473954/s 201% 173% 90% -- + -8% substr_copy 1606272/s 228% 198% 107% 9% + --` [download] Not hard to see the problem with $_. That said I'm not sure that what you say about how things shouldn't be in a sub is correct. Shouldn't it factor out since it would be true for all cases? --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply] [d/l]
Re^3: Ways to delete start of string by ikegami (Patriarch) on May 27, 2008 at 06:54 UTC
Shouldn't it factor out since it would be true for all cases? No. (a+k)/(b+k) is not equal to a/b. And it diminishes the value of the absolute numbers.	[reply]


Just another Perl shrine
	PerlMonks