Just another Perl shrine PerlMonks

### Why is goto &sub slow?

by Ovid (Cardinal)
 on Mar 29, 2004 at 01:01 UTC Need Help??
Ovid has asked for the wisdom of the Perl Monks concerning the following question:

So I'm playing around with recursive subroutines and have results that are a bit confusing. Consider a normal recursive function:

```sub factorial {
my \$num = shift;
return 1 if 1 == \$num;
return \$num * factorial(\$num = 1);
}```

Ordinarily, we expect this to have quite a bit of overhead. Part of the reason for this is that we have deferred part of the calculation. For example, here's what this looks like if we try to calculate 5!

```
1:  factorial(5)
2:  (5 * factorial(4))
3:  (5 * (4 * factorial(3)))
4:  (5 * (4 * (3 * factorial(2))))
5:  (5 * (4 * (3 * (2 * factorial(1)))))
6:  (5 * (4 * (3 * (2 * 1))))
7:  (5 * (4 * (3 * 2)))
8:  (5 * (4 * 6))
9:  (5 * 24)
10: (120)```

The problem with this is that we must defer part of the calculation until we get to the terminal condition (1 == \$num, or step 5) of the recursive function. Then we go back through the call stack and finish the calculations. In theory, we get around this by building our our recursive function in such a way so that we don't need to walk back through the call stack, but instead have the value calculated every step of the way and return directly from the last call, which could allow us to skip steps 6 through 9 above. We can do this by rewriting our factorial function to keep track of the current value of the factorial and using the magical goto &sub notation.

```sub factorial {
my (\$num) = @_;
@_ = (1, 1, \$num);
goto &_factorial;
}

+
sub _factorial {
my (\$result, \$counter, \$max) = @_;
return \$result if \$counter > \$max;
@_ = ((\$result * \$counter), (\$counter + 1), \$max);
goto &_factorial;
}

After playing with this, I realized that I could do even better by using a lexically scoped subroutine and skipping the symbol table lookup, though it winds up being more cumbersome to program.

```my \$_factorial;
\$_factorial = sub {
my (\$result, \$counter, \$max) = @_;
return \$result if \$counter > \$max;
@_ = ((\$result * \$counter), (\$counter + 1), \$max);
goto \$_factorial;
};

sub factorial {
my (\$num) = @_;
@_ = (1, 1, \$num);
goto \$_factorial;
}

While my benchmarks show that the lexically scoped subroutine with goto is slightly faster than the first version using goto, it's considerably slower than the straight recursive function. pdcawley noted in Pure Perl tail call optimization that using this form of goto is not very fast; I'm curious to know why it's so ridiculously slow. I've tested this on 5.8.0 and 5.9.1 (I mention 5.9.1 to make it clear that the memory leak associated with assignment to @_ and calling goto is not the culprit). Benchmark code and results below. Did I do something stupid?

I might add that I know an iterative solution is faster. I'm just trying to figure out why the goto is slow.

```#!/usr/local/bin/perl
use warnings;
use strict;
use Benchmark;
use Test::More tests => 3;

my \$_fact3;
\$_fact3 = sub {
my (\$result, \$counter, \$max) = @_;
return \$result if \$counter > \$max;
@_ = ((\$result * \$counter), (\$counter + 1), \$max);
goto \$_fact3;
};

my \$number = shift or die "No number supplied";

validate_factorial_routines(\$number);

print "Timing factorial of \$number\n";

timethese(10000, {
'recursion' => sub { fact1(\$number) },
'typeglob'  => sub { fact2(\$number) },
'lexical'   => sub { fact3(\$number) },
});

sub validate_factorial_routines {
my \$num = shift;
my @result = (fact1(\$num), fact2(\$num), fact3(\$num));
is(fact1(5), 120, 'fact1 should return the correct amount');
is(fact2(5), 120, 'fact2 should return the correct amount');
is(fact3(5), 120, 'fact3 should return the correct amount');
}

sub fact1 {
my (\$num) = @_;
return _fact1(1,1,\$num);
}

sub _fact1 {
my (\$result, \$counter, \$max) = @_;
return \$result if \$counter > \$max;
return _fact1((\$result * \$counter), (\$counter + 1), \$max);
}

sub fact2 {
my (\$num) = @_;
@_ = (1, 1, \$num);
goto &_fact2;
}

sub _fact2 {
my (\$result, \$counter, \$max) = @_;
return \$result if \$counter > \$max;
@_ = ((\$result * \$counter), (\$counter + 1), \$max);
goto &_fact2;
}

sub fact3 {
my (\$num) = @_;
@_ = (1, 1, \$num);
goto \$_fact3;
}

__END__
1..3
ok 1 - fact1 should return the correct amount
ok 2 - fact2 should return the correct amount
ok 3 - fact3 should return the correct amount
Testing factorial of 30
120
Benchmark: timing 10000 iterations of lexical, recursion, typeglob...
lexical:  4 wallclock secs ( 4.05 usr +  0.00 sys =  4.05 CPU) @ 24
+69.14/s (n=10000)
recursion:  1 wallclock secs ( 1.45 usr +  0.00 sys =  1.45 CPU) @ 68
+96.55/s (n=10000)
typeglob:  5 wallclock secs ( 4.28 usr +  0.00 sys =  4.28 CPU) @ 23
+36.45/s (n=10000)

Update: I realized that I did one thing different in the tests. If I (uselessly) duplicate the assignment to @_ in &_fact1, it's not much faster, but it's still faster.

```1..3
ok 1 - fact1 should return the correct amount
ok 2 - fact2 should return the correct amount
ok 3 - fact3 should return the correct amount
Timing factorial of 30
Benchmark: timing 10000 iterations of lexical, recursion, typeglob...
lexical:  4 wallclock secs ( 4.05 usr +  0.00 sys =  4.05 CPU) @ 24
+69.14/s (n=10000)
recursion:  4 wallclock secs ( 3.87 usr +  0.00 sys =  3.87 CPU) @ 25
+83.98/s (n=10000)
typeglob:  4 wallclock secs ( 4.28 usr +  0.00 sys =  4.28 CPU) @ 23
+36.45/s (n=10000)

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: Why is goto &sub slow?
by broquaint (Abbot) on Mar 29, 2004 at 01:30 UTC
When you perform a goto &sub you're not just calling the function, but you're also subverting the stack by destroying the current frame and any changes made by local within that frame. That's probably not quite the whole story, but I'd say that'd be the major cause of the discrepancy. Also you're tests aren't quite accurate in that the standard subroutine calls don't populate @_ whereas the goto &sub subroutines do. If you modify the tests you'll find the discrepancy will decrease1.
HTH

_________
broquaint

1 you'll find the test modified benchmark and results in the readmore below ...
First, thanks Ovid and broquaint. This is a well done example of Perl, benchmarking and test in a different context than make test.

broquaint, if I understand Your example, there is a typo in sub _fact0, it should return

```    return _fact0((\$result * \$counter), (\$counter + 1), \$max);
That makes recursion 1 twice as fast as recursion 2

And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
(Terry Pratchett, Small Gods)

Re: Why is goto &sub slow?
by kvale (Monsignor) on Mar 29, 2004 at 01:37 UTC
tilly posed the same question on p5p. That  goto &sub is slow is a known issue at perl5porters. And in a subsequent message , it was claimed that "the C<goto &NAME> construct is not for tail recursion, but is for fooling the C<caller> function. ". I think someone also claimed that goto LABEL does not have the speed penalty of  goto &sub. A further message notes that some memory leaks associated with  goto &sub are cleaned up in 5.8.1.

This does not answer your question: why is  goto &sub slow? I don't know a definitive answer, but I gather it takes time to rearrange the stack and clean up the lexical pad before jumping to the new routine.

-Mark

That doesn't gibe. Here's the two sequences, as I understand them.

Standard recursion:

1. First call to func(). (Create first frame on the stack)
2. Second call to func(). (Create second frame on the stack)
3. Return from second call. (Destroy second frame on the stack)
4. Return from first call. (Destory first frame on the stack)

goto &sub;

1. First call to func(). (Create first frame on the stack)
2. goto &sub; (Destory first frame, create first frame)
3. Return from goto call. (Destory first frame)

I count two creations and two destructions of a stack frame. What am I missing? Plus, since goto &sub; doesn't have such a large stack, it takes less memory and lookups take less time. Right?

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

From L-R's updated benchmark and from your benchmark, recursion is only 2-10 percent faster than goto &sub. As you say above, both recursion and goto &sub create and destroy two frames. So the results seem consistent to me. Unless you are refering to that 10 percent slowdown for goto &sub? I don't know what is happening there.

-Mark

Re: Why is goto &sub slow?
by dragonchild (Archbishop) on Mar 29, 2004 at 01:36 UTC
```{
my (\$end, \$curr, \$result);
my \$_factorial;
\$_factorial = sub {
return \$result if \$curr > \$end;
\$result *= \$curr++;
\$_factorial->();
};

sub fact4 {
(\$end, \$curr, \$result) = (shift, 1, 1);
\$_factorial->();
}
}

My benchmarking indicates that my version is 30% faster than the standard recursion and 6 times faster than goto.

```dragonchild:  4 wallclock secs ( 3.19 usr +  0.00 sys =  3.19 CPU) @ 1
+22695.46/s (n=392012)
lexical:  3 wallclock secs ( 3.10 usr +  0.00 sys =  3.10 CPU) @ 21
+976.48/s (n=68215)
recursion:  3 wallclock secs ( 3.08 usr +  0.00 sys =  3.08 CPU) @ 91
+202.59/s (n=281360)
typeglob:  4 wallclock secs ( 3.14 usr +  0.00 sys =  3.14 CPU) @ 21
+142.77/s (n=66494)

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

But that's still full recursion, not tail call elimination, which is what goto &sub is.

jdporter
The 6th Rule of Perl Club is -- There is no Rule #6.

Ok. fine.
```{
my (\$end, \$curr, \$result);
my \$_factorial;
\$_factorial = sub {
return \$result if \$curr > \$end;
\$result *= \$curr++;
goto &\$_factorial;
};

sub fact5 {
(\$end, \$curr, \$result) = (shift, 1, 1);
\$_factorial->();
}
}

And, the benchmarking goes something like:

```Timing factorial of 5
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1:  3 wallclock secs ( 3.19 usr +  0.00 sys =  3.19 CPU) @
+106369.76/s (n=339745)
dragonchild2:  3 wallclock secs ( 3.05 usr +  0.00 sys =  3.05 CPU) @
+89930.56/s (n=274558)
lexical:  3 wallclock secs ( 3.06 usr +  0.00 sys =  3.06 CPU) @ 29
+376.63/s (n=90010)
recursion:  4 wallclock secs ( 3.19 usr +  0.00 sys =  3.19 CPU) @ 68
+035.79/s (n=216694)
typeglob:  3 wallclock secs ( 3.01 usr +  0.00 sys =  3.01 CPU) @ 27
+140.10/s (n=81556)
Rate   typeglob    lexical  recursion dragonchild2 dr
+agonchild1
typeglob      27140/s         --        -8%       -60%         -70%
+      -74%
lexical       29377/s         8%         --       -57%         -67%
+      -72%
recursion     68036/s       151%       132%         --         -24%
+      -36%
dragonchild2  89931/s       231%       206%        32%           --
+      -15%
dragonchild1 106370/s       292%       262%        56%          18%
+        --

In other words, my tail-call is 90k/second, straight recursion is 70k/second, but my straight recursion is 105k/second. The regular tail-call are in the 25-30k/second. I lose about 10% going from recursion to tail-call.

Now, when I shift from factorial(5) to factorial(15) and factorial(25), the benchmarking looks something like:

```Timing factorial of 15
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1:  3 wallclock secs ( 3.00 usr +  0.00 sys =  3.00 CPU) @
+47052.93/s (n=141347)
dragonchild2:  3 wallclock secs ( 3.18 usr +  0.00 sys =  3.18 CPU) @
+40502.36/s (n=128595)
lexical:  4 wallclock secs ( 3.24 usr +  0.00 sys =  3.24 CPU) @ 11
+222.80/s (n=36418)
recursion:  3 wallclock secs ( 3.12 usr +  0.00 sys =  3.12 CPU) @ 26
+999.68/s (n=84104)
typeglob:  3 wallclock secs ( 3.10 usr +  0.00 sys =  3.10 CPU) @ 10
+381.32/s (n=32234)
Rate   typeglob     lexical  recursion dragonchild2 dr
+agonchild1
typeglob     10381/s         --         -7%       -62%         -74%
+      -78%
lexical      11223/s         8%          --       -58%         -72%
+      -76%
recursion    27000/s       160%        141%         --         -33%
+      -43%
dragonchild2 40502/s       290%        261%        50%           --
+      -14%
dragonchild1 47053/s       353%        319%        74%          16%
+        --
------------------------------------
Timing factorial of 25
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1:  3 wallclock secs ( 3.01 usr +  0.00 sys =  3.01 CPU) @
+29099.20/s (n=87705)
dragonchild2:  3 wallclock secs ( 3.12 usr +  0.00 sys =  3.12 CPU) @
+25319.04/s (n=79122)
lexical:  4 wallclock secs ( 3.30 usr +  0.00 sys =  3.30 CPU) @ 68
+28.22/s (n=22499)
recursion:  4 wallclock secs ( 3.23 usr +  0.00 sys =  3.23 CPU) @ 16
+312.62/s (n=52755)
typeglob:  3 wallclock secs ( 3.23 usr +  0.00 sys =  3.23 CPU) @ 61
+14.10/s (n=19773)
Rate   typeglob     lexical  recursion dragonchild2 dr
+agonchild1
typeglob      6114/s         --        -10%       -63%         -76%
+      -79%
lexical       6828/s        12%          --       -58%         -73%
+      -77%
recursion    16313/s       167%        139%         --         -36%
+      -44%
dragonchild2 25319/s       314%        271%        55%           --
+      -13%
dragonchild1 29099/s       376%        326%        78%          15%
+        --

I have no idea why my recursion is still beating my tail-call. The ratio is mostly staying the same, at least between my versions. At 80, this is what happens:

```Timing factorial of 80
Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ
+eglob for at least 3 CPU seconds...
dragonchild1:  4 wallclock secs ( 3.18 usr +  0.00 sys =  3.18 CPU) @
+8834.80/s (n=28130)
dragonchild2:  4 wallclock secs ( 3.17 usr +  0.00 sys =  3.17 CPU) @
+7694.39/s (n=24422)
lexical:  3 wallclock secs ( 3.11 usr +  0.00 sys =  3.11 CPU) @ 20
+67.44/s (n=6438)
recursion:  3 wallclock secs ( 3.25 usr +  0.00 sys =  3.25 CPU) @ 50
+47.46/s (n=16379)
typeglob:  3 wallclock secs ( 3.32 usr +  0.00 sys =  3.32 CPU) @ 18
+79.40/s (n=6249)
Rate    typeglob     lexical  recursion dragonchild2 dr
+agonchild1
typeglob     1879/s          --         -9%       -63%         -76%
+      -79%
lexical      2067/s         10%          --       -59%         -73%
+      -77%
recursion    5047/s        169%        144%         --         -34%
+      -43%
dragonchild2 7694/s        309%        272%        52%           --
+      -13%
dragonchild1 8835/s        370%        327%        75%          15%
+        --

I'm not sure what all those numbers really mean, but there they are. :-)

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Re: Why is goto &sub slow?
by biosysadmin (Deacon) on Mar 29, 2004 at 04:06 UTC

In the spirit of TIMTOWTDI, here's another option for people who are interested:

I assume you realize that you listed a factorial table and not a fibonacci table? :)

Cheers,
Ovid

New address of my CGI Course.

10+9+8+7+6+5+4+3+2+1 doesn't equal 2721600?

Only if you are using base 10 math and boring Euclidean geometry. Think outside the box!

Re: Why is goto &sub slow?
by zentara (Archbishop) on Mar 29, 2004 at 17:40 UTC
Well it may be getting a little off-topic for your question, but how about how parrot and perl6 will do it?
```#factorial.pasm
print "The first 15 factorials are:\n"
set I1, 0
set I2, 15
set I3, 1
REDO:  inc I1
mul I3, I3, I1
print I3
print "\n"
lt I1, I2, REDO
DONE:  end

To run it,
```parrot -o factorial.pbc factorial.pasm
then
parrot factorial.pbc
The whole parrot assembly register setup is so cool, I think alot of Perl6 will be "Inline::Pasm", :-)

I'm not really a human, but I play one on earth. flash japh
Re: Why is goto &sub slow?
by bunnyman (Hermit) on Mar 29, 2004 at 16:48 UTC

Create A New User
Node Status?
node history
Node Type: perlquestion [id://340465]
Approved by phydeauxarff
Front-paged by Enlil
help
Chatterbox?
 Discipulus SharePoint will insinuate in my \$workspace I think i'll need alot of patience.. [erix]: about perfide albion [erix]: I guess that should be Perfidious Albion Discipulus must prepare some ammunitions: XML::Compile::SOAP but what the heck!

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2017-06-27 09:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How many monitors do you use while coding?

Results (601 votes). Check out past polls.