Ovid has asked for the wisdom of the Perl Monks concerning the following question:
So I'm playing around with recursive subroutines and have results that are a bit confusing. Consider a normal recursive function:
sub factorial {
my $num = shift;
return 1 if 1 == $num;
return $num * factorial($num = 1);
}
Ordinarily, we expect this to have quite a bit of overhead. Part of the reason for this is that we have deferred part of the calculation. For example, here's what this looks like if we try to calculate 5!
1: factorial(5)
2: (5 * factorial(4))
3: (5 * (4 * factorial(3)))
4: (5 * (4 * (3 * factorial(2))))
5: (5 * (4 * (3 * (2 * factorial(1)))))
6: (5 * (4 * (3 * (2 * 1))))
7: (5 * (4 * (3 * 2)))
8: (5 * (4 * 6))
9: (5 * 24)
10: (120)
The problem with this is that we must defer part of the calculation until we get to the terminal condition (1 == $num, or step 5) of the recursive function. Then we go back through the call stack and finish the calculations. In theory, we get around this by building our our recursive function in such a way so that we don't need to walk back through the call stack, but instead have the value calculated every step of the way and return directly from the last call, which could allow us to skip steps 6 through 9 above. We can do this by rewriting our factorial function to keep track of the current value of the factorial and using the magical goto &sub notation.
sub factorial {
my ($num) = @_;
@_ = (1, 1, $num);
goto &_factorial;
}
+
sub _factorial {
my ($result, $counter, $max) = @_;
return $result if $counter > $max;
@_ = (($result * $counter), ($counter + 1), $max);
goto &_factorial;
}
After playing with this, I realized that I could do even better by using a lexically scoped subroutine and skipping the symbol table lookup, though it winds up being more cumbersome to program.
my $_factorial;
$_factorial = sub {
my ($result, $counter, $max) = @_;
return $result if $counter > $max;
@_ = (($result * $counter), ($counter + 1), $max);
goto $_factorial;
};
sub factorial {
my ($num) = @_;
@_ = (1, 1, $num);
goto $_factorial;
}
While my benchmarks show that the lexically scoped subroutine with goto is slightly faster than the first version using goto, it's considerably slower than the straight recursive function. pdcawley noted in Pure Perl tail call optimization that using this form of goto is not very fast; I'm curious to know why it's so ridiculously slow. I've tested this on 5.8.0 and 5.9.1 (I mention 5.9.1 to make it clear that the memory leak associated with assignment to @_ and calling goto is not the culprit). Benchmark code and results below. Did I do something stupid?
I might add that I know an iterative solution is faster. I'm just trying to figure out why the goto is slow.
#!/usr/local/bin/perl
use warnings;
use strict;
use Benchmark;
use Test::More tests => 3;
my $_fact3;
$_fact3 = sub {
my ($result, $counter, $max) = @_;
return $result if $counter > $max;
@_ = (($result * $counter), ($counter + 1), $max);
goto $_fact3;
};
my $number = shift or die "No number supplied";
validate_factorial_routines($number);
print "Timing factorial of $number\n";
timethese(10000, {
'recursion' => sub { fact1($number) },
'typeglob' => sub { fact2($number) },
'lexical' => sub { fact3($number) },
});
sub validate_factorial_routines {
my $num = shift;
my @result = (fact1($num), fact2($num), fact3($num));
is(fact1(5), 120, 'fact1 should return the correct amount');
is(fact2(5), 120, 'fact2 should return the correct amount');
is(fact3(5), 120, 'fact3 should return the correct amount');
}
sub fact1 {
my ($num) = @_;
return _fact1(1,1,$num);
}
sub _fact1 {
my ($result, $counter, $max) = @_;
return $result if $counter > $max;
return _fact1(($result * $counter), ($counter + 1), $max);
}
sub fact2 {
my ($num) = @_;
@_ = (1, 1, $num);
goto &_fact2;
}
sub _fact2 {
my ($result, $counter, $max) = @_;
return $result if $counter > $max;
@_ = (($result * $counter), ($counter + 1), $max);
goto &_fact2;
}
sub fact3 {
my ($num) = @_;
@_ = (1, 1, $num);
goto $_fact3;
}
__END__
1..3
ok 1 - fact1 should return the correct amount
ok 2 - fact2 should return the correct amount
ok 3 - fact3 should return the correct amount
Testing factorial of 30
120
Benchmark: timing 10000 iterations of lexical, recursion, typeglob...
lexical: 4 wallclock secs ( 4.05 usr + 0.00 sys = 4.05 CPU) @ 24
+69.14/s (n=10000)
recursion: 1 wallclock secs ( 1.45 usr + 0.00 sys = 1.45 CPU) @ 68
+96.55/s (n=10000)
typeglob: 5 wallclock secs ( 4.28 usr + 0.00 sys = 4.28 CPU) @ 23
+36.45/s (n=10000)
Update: I realized that I did one thing different in the tests. If I (uselessly) duplicate the assignment to @_ in &_fact1, it's not much faster, but it's still faster.
1..3
ok 1 - fact1 should return the correct amount
ok 2 - fact2 should return the correct amount
ok 3 - fact3 should return the correct amount
Timing factorial of 30
Benchmark: timing 10000 iterations of lexical, recursion, typeglob...
lexical: 4 wallclock secs ( 4.05 usr + 0.00 sys = 4.05 CPU) @ 24
+69.14/s (n=10000)
recursion: 4 wallclock secs ( 3.87 usr + 0.00 sys = 3.87 CPU) @ 25
+83.98/s (n=10000)
typeglob: 4 wallclock secs ( 4.28 usr + 0.00 sys = 4.28 CPU) @ 23
+36.45/s (n=10000)
Re: Why is goto &sub slow?
by broquaint (Abbot) on Mar 29, 2004 at 01:30 UTC
|
When you perform a goto &sub you're not just calling the function, but you're also subverting the stack by destroying the current frame and any changes made by local within that frame. That's probably not quite the whole story, but I'd say that'd be the major cause of the discrepancy. Also you're tests aren't quite accurate in that the standard subroutine calls don't populate @_ whereas the goto &sub subroutines do. If you modify the tests you'll find the discrepancy will decrease1.
1 you'll find the test modified benchmark and results in the readmore below ...
| [reply] [Watch: Dir/Any] [d/l] |
|
return _fact0(($result * $counter), ($counter + 1), $max);
That makes recursion 1 twice as fast as recursion 2
And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
(Terry Pratchett, Small Gods)
| [reply] [Watch: Dir/Any] [d/l] |
Re: Why is goto &sub slow?
by kvale (Monsignor) on Mar 29, 2004 at 01:37 UTC
|
tilly posed the same question on p5p. That goto &sub is slow is a known issue at perl5porters. And in a subsequent message , it was claimed that "the C<goto &NAME> construct is not for tail recursion, but is for fooling the C<caller> function. ". I think someone also claimed that goto LABEL does not have the speed penalty of goto &sub. A further message notes that some memory leaks associated with goto &sub are cleaned up in 5.8.1.
This does not answer your question: why is
goto &sub slow? I don't know a definitive answer, but I gather it takes time to rearrange the stack and clean up the lexical pad before jumping to the new routine.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
From L-R's updated benchmark and from your benchmark, recursion is only 2-10 percent faster than goto &sub. As you say above, both recursion and goto &sub create and destroy two frames. So the results seem consistent to me. Unless you are refering to that 10 percent slowdown for goto &sub? I don't know what is happening there.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Why is goto &sub slow?
by dragonchild (Archbishop) on Mar 29, 2004 at 01:36 UTC
|
{
my ($end, $curr, $result);
my $_factorial;
$_factorial = sub {
return $result if $curr > $end;
$result *= $curr++;
$_factorial->();
};
sub fact4 {
($end, $curr, $result) = (shift, 1, 1);
$_factorial->();
}
}
My benchmarking indicates that my version is 30% faster than the standard recursion and 6 times faster than goto.
dragonchild: 4 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 1
+22695.46/s (n=392012)
lexical: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 21
+976.48/s (n=68215)
recursion: 3 wallclock secs ( 3.08 usr + 0.00 sys = 3.08 CPU) @ 91
+202.59/s (n=281360)
typeglob: 4 wallclock secs ( 3.14 usr + 0.00 sys = 3.14 CPU) @ 21
+142.77/s (n=66494)
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
But that's still full recursion, not tail call elimination, which is what goto &sub is.
jdporter The 6th Rule of Perl Club is -- There is no Rule #6.
| [reply] [Watch: Dir/Any] [d/l] |
|
{
my ($end, $curr, $result);
my $_factorial;
$_factorial = sub {
return $result if $curr > $end;
$result *= $curr++;
goto &$_factorial;
};
sub fact5 {
($end, $curr, $result) = (shift, 1, 1);
$_factorial->();
}
}
And, the benchmarking goes something like:
Timing factorial of 5
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1: 3 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @
+106369.76/s (n=339745)
dragonchild2: 3 wallclock secs ( 3.05 usr + 0.00 sys = 3.05 CPU) @
+89930.56/s (n=274558)
lexical: 3 wallclock secs ( 3.06 usr + 0.00 sys = 3.06 CPU) @ 29
+376.63/s (n=90010)
recursion: 4 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 68
+035.79/s (n=216694)
typeglob: 3 wallclock secs ( 3.01 usr + 0.00 sys = 3.01 CPU) @ 27
+140.10/s (n=81556)
Rate typeglob lexical recursion dragonchild2 dr
+agonchild1
typeglob 27140/s -- -8% -60% -70%
+ -74%
lexical 29377/s 8% -- -57% -67%
+ -72%
recursion 68036/s 151% 132% -- -24%
+ -36%
dragonchild2 89931/s 231% 206% 32% --
+ -15%
dragonchild1 106370/s 292% 262% 56% 18%
+ --
In other words, my tail-call is 90k/second, straight recursion is 70k/second, but my straight recursion is 105k/second. The regular tail-call are in the 25-30k/second. I lose about 10% going from recursion to tail-call.
Now, when I shift from factorial(5) to factorial(15) and factorial(25), the benchmarking looks something like:
Timing factorial of 15
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @
+47052.93/s (n=141347)
dragonchild2: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @
+40502.36/s (n=128595)
lexical: 4 wallclock secs ( 3.24 usr + 0.00 sys = 3.24 CPU) @ 11
+222.80/s (n=36418)
recursion: 3 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) @ 26
+999.68/s (n=84104)
typeglob: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 10
+381.32/s (n=32234)
Rate typeglob lexical recursion dragonchild2 dr
+agonchild1
typeglob 10381/s -- -7% -62% -74%
+ -78%
lexical 11223/s 8% -- -58% -72%
+ -76%
recursion 27000/s 160% 141% -- -33%
+ -43%
dragonchild2 40502/s 290% 261% 50% --
+ -14%
dragonchild1 47053/s 353% 319% 74% 16%
+ --
------------------------------------
Timing factorial of 25
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1: 3 wallclock secs ( 3.01 usr + 0.00 sys = 3.01 CPU) @
+29099.20/s (n=87705)
dragonchild2: 3 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) @
+25319.04/s (n=79122)
lexical: 4 wallclock secs ( 3.30 usr + 0.00 sys = 3.30 CPU) @ 68
+28.22/s (n=22499)
recursion: 4 wallclock secs ( 3.23 usr + 0.00 sys = 3.23 CPU) @ 16
+312.62/s (n=52755)
typeglob: 3 wallclock secs ( 3.23 usr + 0.00 sys = 3.23 CPU) @ 61
+14.10/s (n=19773)
Rate typeglob lexical recursion dragonchild2 dr
+agonchild1
typeglob 6114/s -- -10% -63% -76%
+ -79%
lexical 6828/s 12% -- -58% -73%
+ -77%
recursion 16313/s 167% 139% -- -36%
+ -44%
dragonchild2 25319/s 314% 271% 55% --
+ -13%
dragonchild1 29099/s 376% 326% 78% 15%
+ --
I have no idea why my recursion is still beating my tail-call. The ratio is mostly staying the same, at least between my versions. At 80, this is what happens:
Timing factorial of 80
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1: 4 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @
+8834.80/s (n=28130)
dragonchild2: 4 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @
+7694.39/s (n=24422)
lexical: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 20
+67.44/s (n=6438)
recursion: 3 wallclock secs ( 3.25 usr + 0.00 sys = 3.25 CPU) @ 50
+47.46/s (n=16379)
typeglob: 3 wallclock secs ( 3.32 usr + 0.00 sys = 3.32 CPU) @ 18
+79.40/s (n=6249)
Rate typeglob lexical recursion dragonchild2 dr
+agonchild1
typeglob 1879/s -- -9% -63% -76%
+ -79%
lexical 2067/s 10% -- -59% -73%
+ -77%
recursion 5047/s 169% 144% -- -34%
+ -43%
dragonchild2 7694/s 309% 272% 52% --
+ -13%
dragonchild1 8835/s 370% 327% 75% 15%
+ --
I'm not sure what all those numbers really mean, but there they are. :-)
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Why is goto &sub slow?
by biosysadmin (Deacon) on Mar 29, 2004 at 04:06 UTC
|
In the spirit of TIMTOWTDI, here's another option for people who are interested:
It may not be exactly what you're talking about, but your thread made me think of it. :)
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Why is goto &sub slow?
by zentara (Archbishop) on Mar 29, 2004 at 17:40 UTC
|
Well it may be getting a little off-topic for your question, but how about how parrot and perl6 will do it?
#factorial.pasm
print "The first 15 factorials are:\n"
set I1, 0
set I2, 15
set I3, 1
REDO: inc I1
mul I3, I3, I1
print I3
print "\n"
lt I1, I2, REDO
DONE: end
To run it,
parrot -o factorial.pbc factorial.pasm
then
parrot factorial.pbc
The whole parrot assembly register setup is so cool,
I think alot of Perl6 will be "Inline::Pasm", :-)
I'm not really a human, but I play one on earth.
flash japh
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Why is goto &sub slow?
by bunnyman (Hermit) on Mar 29, 2004 at 16:48 UTC
|
| [reply] [Watch: Dir/Any] |
|
|