http://www.perlmonks.org?node_id=885839


in reply to [OT]: threading recursive subroutines.

Good question! I've been working on it to see what the real issue is. I don't have an answer for you; I'm just documenting what I've found for everyone's benefit.

First, I presume that "inherently recursive" means "cannot be rewritten to be tail-recursive". Fibonacci is an example of an inherently recursive algorithm.

sub fibonacci { my ($n) = @_; return 0 if $n == 0; return 1 if $n == 1; return sum fibonacci($n-2), fibonacci($n-1); }

At face value, it's quite easy to make it threaded:

sub fibonacci { my ($n) = @_; return 0 if $n == 0; return 1 if $n == 1; return sum map $_->join(), async { fibonacci($n-2) }, async { fibonacci($n-1) }; }

But what if you wanted to limit the number of threads (perhaps to limit overhead)? One solution would be to have a pool of threads and to use them if available.

sub fibonacci { my ($n) = @_; return 0 if $n == 0; return 1 if $n == 1; return sum map $_->get(), async_maybe { fibonacci($n-2) }, async_maybe { fibonacci($n-1) }; }

(The above assumes closures can be shared, but it can be rewritten to not use closures.)

When implementing async_maybe (and the get of the object it returns), one must be extra careful to avoid the situation where a thread is waiting to have it's result collected.

But what if you want a worker model (perhaps to distribute the work to other machines)? Now, that's hard. One would need some kind of a callback system.

sub fibonacci { my ($n) = @_; my $result; process_and_wait(sub { fibonacci_task(sub { $result = $_[0]; }, $n ); }); return $result; } sub fibonacci_task { my ($on_complete, $n) = @_; return $on_complete->(0) if $n == 0; return $on_complete->(1) if $n == 1; my ($x,$y); process(sub { fibonacci_task(sub { $x = $_[0] }, $n-2) }); process(sub { fibonacci_task(sub { $y = $_[0] }, $n-1) }); #TODO: The last of the two tasks to complete # must call $on_complete->($x+$y). }

The TODO item is hard to do cleanly. And of course, I'm still using closures even though I think that's not an option for threads threads. Rewriting to avoid closures will likely make the code even longer.

Replies are listed 'Best First'.
Re^2: [OT]: threading recursive subroutines.
by BrowserUk (Patriarch) on Feb 03, 2011 at 03:58 UTC

    Whilst I think that your discussion of the problems is correct, I think your choice of algorithm and your code for demonstrating the problems is problematic. The latter in part because you are using Perl, which I specifically excluded; in part because you then use Coro which IMO clouds the real issues and adds more of its own.

    Firstly, Fibonacci is a poor example because it is simply so easy to achieve better performance through mechanisms other than threading. As the only purpose of threading in this context is improving performance, using them for Fibonacci is never going to work effectively.

    Whilst the recursive Fibonacci algorithm doesn't lend itself to tail call elimination, it is easily written iteratively for an huge performance gain. Equally, the recursive algorithm can be sped up immensely by simply memoising it.

    But more importantly, the iteration/recursion can easily be done away with entirely using Binet's formula:

    All of these modifications make it possible to calculate Fibonacci( 1474 ) (the limit of a doubles ability to store the result), more quickly than it is possible to spawn a thread. At least using the current implementation of iThreads. And possibly Coro also.

    I realise that you chose it only as a well-known example on which to base your discussion. But if you are going to explore the threading of recursion in Perl, there are simpler ways than your implementations. This does the job:

    #! perl -slw use strict; use futures; sub fibonacci { my $n = shift; return $n if $n < 2; my $r1 = futures->new( \&fib_t1, $n -2 ); my $r2 = futures->new( \&fib_t1, $n -1 ); return $r1 + $r2; } sub fib_t1 { my $n = shift; return $n if $n < 2; my $r1 = futures->new( \&fib_t2, $n -2 ); my $r2 = futures->new( \&fib_t2, $n -1 ); return $r1 + $r2; } sub fib_t2 { my $n = shift; return $n if $n < 2; return fib_t2( $n -2 ) + fib_t2( $n -1 ); } printf "$_ : %d\n", fibonacci( $_ ) for 1 .. $ARGV[ 0 ]; __END__ C:\test>885839-t 20 1 : 1 2 : 1 3 : 2 4 : 3 5 : 5 6 : 8 7 : 13 8 : 21 9 : 34 10 : 55 11 : 89 12 : 144 13 : 233 14 : 377 15 : 610 16 : 987 17 : 1597 18 : 2584 19 : 4181 20 : 6765

    Where the futures module is:

    package futures; use threads; sub TIESCALAR { my( $class, $thread ) = @_; return bless [ $thread ], $class; } sub STORE { die "Futures cannot be assigned to"; } sub FETCH { my $self = shift; my $r = $self->[0]->join; return $r; } sub new { my( $class, $code, @args ) = @_; my $thread = &async( $code, @args ); tie my $future, 'futures', $thread; return $future; } 1;

    Using three subs serves to highlight the level dependency, but these could be merged into one sub with conditional statements provided there was some way to determine the number of existing threads.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      One idea that I find interesting in practice (but not from an algorithmical point of view) is Grand Central Dispatch, basically the idea of having a worker pool of threads (likely about 2x cores) that get managed by the OS. This approach seems to take the burden of thread administration away from the program and stuffs it into a centralized infrastructure that makes sure that no more than a set limit of threads is running. It seems that Grand Central Dispatch suffers from different-but-still-ugly locking problems, but that's expected.

      I think that the idea of "futures" has merit - AnyEvent uses something like them in its AnyEvent->condvar mechanism. AnyEvent has at least some way of detecting deadlocks, but as it is basically single-threaded, detecting deadlocks is easier there I presume. Having transparent "futures" in Perl is feasible through trickery with transparent proxy objects like you present or Win32::OLE and MozRepl::RemoteObject implement in a more general fashion. I think that the performance hit you incur by heavily using tied variables will again negate many of the speed gains you get. I'm also not sure whether having transparent futures will actually help much, because you still need to make really sure that you don't immediately fetch the results after starting a calculation.

      If there is a way to detect the dependency chain of futures ahead of time, implementing a thread pool together with the futures could limit the amount of threads spawned with your implementation. But I'm not sure whether that's possible as each future could spawn another future it depends on and thus quickly overrun each fixed limit. But maybe using Coro together with threads could allow switching the context without starting a fresh thread once we run out of CPU threads. But mixing Coro and threads is something that's very unlikely to work...

      One of my "easier" thought experiments is the idea of switching Perl to asynchronous IO. I think this is somewhat easier to analyze, as IO iterators are used more often than threads are used, and their use does not require deep understanding and (dead-)locking operations are unnecessary. For example, a "lazy file" could work with your/a "future" implementation by pretending to have read a line from a file, until somebody looks at the value:

      open_async my $fh, '/etc/passwd' or die "Couldn't read asynchronously from '/etc/passwd': $!"; while (<$fh>) { # $_ is a tied variable that will be filled in a background thread # or through asynchronous IO next unless /root/; };

      Of course, most (existing) Perl code will simply not benefit from asynchronous IO, as most code will read a line from a file and immediately inspect each value. This will simply negate all benefits we gain from freeing up Perl from waiting for the line to be fetched from the file. Maybe we could gain something by requesting more than the one line (or more than the one buffer) to be read from the file, but that will likely get problematic if we try to use tell and other methods for looking at files.

      My conclusion is that implicit and transparent parallelism will likely simply not work because the existing procedural programs will not make use of it. So in any case, specialization from the straightforward approach towards a more parallel approach is necessary (for Perl programs). The likely "easy" solution are callback-oriented or continuation-passing models using closures like AnyEvent or Node.js. There you start the asynchronous call and give it a callback to execute once the operation completes.

        I did say up front that my interest in this is not directly related to Perl. Leastwise not as it stands with Perl 5 and iThreads. The overheads of the Perl5 function calls (including ties), combined with those of iThreads, make this a non-starter as a realistically usable bolt-on to Perl 5.

        However, I do have notions and bits of code for a 64-bit only interpreter that demonstrates highly efficient function & method call performance that would, if I ever get around to implementing it, allow for transparent parallisation.

        The concept is not dissimilar to GCD or the Erlang SMP VM.

        In effect, every function/method call is actually queued (with is arguments as closures), to a central queue, rather than executed immediately, and immediately returns a future.

        One (or more) interpreters per core are instantiated at start-up, and they loop over the central queue executing the coderefs with closures in turn as they come off the queue.

        The futures contain a monotonically increasing 'sequence number'. The queue is effectively prioritised according to these sequence numbers.

        Everything--numbers, strings, code-blocks (functions and methods, but also the bodies of if statements etc.)--are object references. Objects carry being-read and being-written flags.

        Any method requiring write access to an object will be requeued if that object is being read or written. Any method requiring read access will be requeued if the object is being written.

        The problem child is currently recursion.

        There is much that is yet to be explored.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      But if you are going to explore the threading of recursion in Perl, there are simpler ways than your implementations. This does the job:

      No, that does not limit the number of workers.

      Equally, the recursive algorithm can be sped up immensely by simply memoising it.

      It doesn't seem to be a technique you can use for distributed work, which is where I thought you were going with this.

      All of these modifications make it possible to calculate Fibonacci( 1474 )

      The point was to find a technique. Or are you saying the technique will never be as efficient as alternatives?

      Anyway, I saw your post about the Ackermann function, and that's a whole different ballgame. I spent a lot of time tinkering with that too after you mentioned it.

      As far as I know, one can't make y = f(x); z = f(y); parallel without refactoring by a human, yet the Ackermann function is of that form.

      That said, it does have interesting traits that may make parallisation possible (or impossible).

      • If you need calculate A(m,n), you know you will also need to calculate A(m,i) for 0<i<n.
      • For m>0 and n>0, A(m,n) is a series of n+1 evaluations of A(m-1,i) with increasing i.
      • ...

      But I'm in way over my head.

        The point was to find a technique. Or are you saying the technique will never be as efficient as alternatives?

        My point was that using memoisation, iteration or direct calculation, calculating fibonacci to the limits of the machines ability to represent the result is practical, whereas the purely recursive routine would run for an inordinate amount of time.

        I'm also saying that as you can directly calculate all the representable values of Fib(n) in just over a millisecond:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; use 5.010; use Time::HiRes qw[ time ]; use constant { GR => 1.6180339887498948482, ROOT5 => sqrt( 5 ), }; sub fibCalc { int( GR ** $_[0] / ROOT5 + 0.5 ); } my $start = time; fibCalc( $_ ) for 1 .. 1474; printf "calulating fibonacci( n ) for n := 1 to 1474 takes: %f seconds +\n", time() - $start; __END__ C:\test>885839 1000 calulating fibonacci( n ) for n := 1 to 1474 takes: 0.001122 seconds

        But doing just the first 30 recursively takes 49 seconds (from the post above):

        recursive: Got 24157816; took 49.278000 seconds

        you'd need 50,000 threads and perfect efficiency gains to allow a parallelised recursive routine to calculate those first 30 in the same time that the direct calculation does the first 1474.

        Basically, I don't believe you could ever match the efficiency of teh direct calculation no matter if you could throw millions of threads at the problem.

        As far as I know, one can't make ...parallel without refactoring by a human, yet the Ackermann function is of that form.

        That said, it does have traits that may make parallisation possible. For example, if you need calculate A(m,n), you know you will also need to calculate A(m,i) for 0<i<n.

        But in I'm way over my head.

        Actually, I think that you are pretty much exactly the same point as I am.

        You see that there is some scope for parallelising Ackermann, but you don't see how to do it.

        There are obviously great tranches of intermediate results that need to be calculated. For Ackermann( 3, 15 ) there are 655365 intermediate results. It seems inconceivable (to me) that there isn't scope for splitting this work across my 4 cores.

        Of course, you can memoise Ackermann and it does effect a good speed-up, but the memory used gets huge very quickly, which makes it impractical as a generic solution. This is true for many similar algorithms.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I just noticed this, I don't think it was a part of the post when I originally saw it. We must have overlapped.

        No, that does not limit the number of workers.

        Actually, the futures code does limit the number of workers. In the case of the posted sample, to 6 threads (plus the original):

        #! perl -slw use strict; use futures; sub fibonacci { my $n = shift; return $n if $n < 2; my $r1 = futures->new( \&fib_t1, $n -2 ); #1 my $r2 = futures->new( \&fib_t1, $n -1 ); #2 return $r1 + $r2; } sub fib_t1 { my $n = shift; return $n if $n < 2; my $r1 = futures->new( \&fib_t2, $n -2 ); #3 & #5 my $r2 = futures->new( \&fib_t2, $n -1 ); #4 & #6 return $r1 + $r2; } sub fib_t2 { my $n = shift; return $n if $n < 2; return fib_t2( $n -2 ) + fib_t2( $n -1 ); }

        It does it through the expedient of using 3 levels of sub call. The first two spawn threads, but the third, fib_t2(), uses straight recursion. This was done manually, and 3 levels were chosen so as to have 4 threads running for the majority of the time, to match my 4-cores.

        But it is not hard to see how this could be done automatically at either compile-time or run-time.

        There is the possibility of avoiding the two essentially redundant spawns at the first level, by having the second level return a future that closes over the addition of the two futures it creates.

        Something like:

        #! perl -slw use strict; use futures; sub fibonacci { my $n = shift; return $n if $n < 2; return fib_t1( $n -2 ) + fib_t1( $n -1 ); } sub fib_t1 { my $n = shift; return $n if $n < 2; return futures->new( sub{ futures->new( \&fib_t2, $n-2 ) + futures->new( \&fib_t2, +$n-1 ) } ); } sub fib_t2 { my $n = shift; return $n if $n < 2; return fib_t2( $n -2 ) + fib_t2( $n -1 ); }

        In this way, only 4 threads would be created and the original thread would wait for their completion.

        And by eliminating the temporary variables, it possible to see how the forms of the three levels of subroutine actually closely mirror each other.

        The next step is to move the decision as to which of the forms to use for any given level under the covers of the futures module. At that point, the user constructs a single recursive routine in the familiar way, but using futures for each recursive call, and the module keeps track of how many threads it is spawning.

        Ultimately, rather than having to use the futures constructor, the goal would be to simply mark the subroutine with an attribute and have it all taken care of transparently. Something like:

        sub fibonacci :future { my $n = shift; return $n if $n < 2; return fibonacci( $n -2 ) + fibonacci( $n -1 ); }

        Now, everything that was done manually in the second code block above should be feasible without further application programmer intervention.

        As already discussed, it wouldn't benefit such a trivial function, but it could for others.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: [OT]: threading recursive subroutines.
by ikegami (Patriarch) on Feb 02, 2011 at 21:21 UTC

    Turned out that using closures, while providing a simple design, actually caused the problem. The design below still needs work because it passes closures around, but it works in concept as shown by the Coro implementation below. (It probably defies the point to use Coro, but it's easier to prototype using it.)

    use strict; use warnings; use Engine qw( ); use constant NUM_WORKERS => 4; sub fibonacci_task { my ($engine, $on_complete, $n) = @_; return $on_complete->(0) if $n == 0; return $on_complete->(1) if $n == 1; my ($x,$y); $engine->process_group( sub { $on_complete->($x+$y) }, [ \&fibonacci_task => (sub { $x = $_[0] }, $n-2) ], [ \&fibonacci_task => (sub { $y = $_[0] }, $n-1) ], ); } sub fibonacci { my ($engine, $n) = @_; my $result; $engine->process_and_wait( \&fibonacci_task => ( sub { $result = $_[0] }, $n ) ); return $result; } { my $engine = Engine->new(NUM_WORKERS); printf("%s! = %s\n", $_, fibonacci($engine, $_)) for 1..10; }
    1! = 1 2! = 1 3! = 2 4! = 3 5! = 5 6! = 8 7! = 13 8! = 21 9! = 34 10! = 55

    The following belongs at the top of the above script: