http://www.perlmonks.org?node_id=424933

A while back, I left a node about concurrency in Perl. I ended that meditation with a question about how to improve concurrency in Perl.

This node recommends a pragma which gives certain guarantees to perl (note lowercase) that should allow implicit parallelization with minimal modification to existing code.

Implicit Parallelization in Functional Languages

Here's a quick-and-dirty introduction to functional languages, and how they can be parallelized by the implementation without modifying any code: Imagine a LISP function that takes an argument x and returns the result of 2 * x + 1. We can implement this as:

(defun func1 (x) (+ 1 (* 2 x)))

When you call this function, the implementation evaluates the inner-most expression first, each expression being reduced until the final result is computed:

(func1 3) ; Function call (+ 1 (* 2 3)) ; All occurrences of 'x' replaced with '3' (+ 1 6) ; Reduce 7 ; Reduce

With me so far? This should be familer to anyone who has worked through the first chapter of SICP.

Now we define another function that takes an argument x and computes (2 * x) + (3 * x) + 1:

(defun func2 (x) (+ 1 (* 2 x) (* 3 x)))

And then we call it, with the implementation replacing the variable as before:

(func2 3) ; Function call (+ 1 (* 2 3) (* 3 3)) ; All ocurrences of 'x' replaced with '3'

Stop right there. Do we evaluate (* 2 3) or (* 3 3) first? It doesn't matter, as long as the expression contains no side effects. One interesting effect of this is that the implementation can choose to execute these expressions in parallel without any additional help from the programmer.

(Of course, in this simple example, practical issues like communication time between CPUs are likely to make the parallel execution take longer. However, the same concepts can be applied to much more complex expressions.)

Implementing in Perl

Now consider this Perl map statement, which takes an array of strings and returns the length of each of those strings:

my @str_lengths = map { length } @strs;

This statement can be safely parallelized. Each execution of the function does not depend on the next one. The implementation could parallelize this statement on its own, effectively breaking it from O(n) to O(n/p), where p is the number of concurrent threads doing the processing.

The problem is that Perl provides no way to guarantee that a given piece of code has no side effects. If perl tries to parallelize a statement that has side effects, then the result is non-deterministic. Programmers don't like non-determinism.

Thus, we need a way to tell perl that we guarantee that there are no side effects in an expression (and it's on our head if we break that promise). That's where the threadsafe pragma would come in. This pragma would be lexically scoped:

my @str_lengths = do { use threadsafe; map { length } @strs; };

This means the programmer has fined-grained control over what constructs are marked safe to parellelize. This is important so that safe and unsafe expressions can be in the same source code, and is also useful when there is no practical benefit to parallelization (as in the LISP example above).

Multi-core processors are becoming increasingly popular. Within a few years, even $300 computers at Best Buy will likely have such a beast. Languages and Programmers must start equipping themselves now to make use of these systems.

"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Replies are listed 'Best First'.
Re: RFC: Implicit Parallelization Pragma
by Anonymous Monk on Jan 25, 2005 at 17:53 UTC
    The problem is that Perl provides no way to guarantee that a given piece of code has no side effects.
    Indeed. And in Perl, this is very, very hard. How hard? Well, your own example fails the test. Even the map {length} @array may cause a side-effect.
    use Devel::Peek; my @arr = ("foo", 3); Dump($arr[-1]); my @bogus = map {length} @arr; Dump($arr[-1]); __END__ SV = IV(0x8192bcc) at 0x8184234 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 3 SV = PVIV(0x8184888) at 0x8184234 REFCNT = 1 FLAGS = (IOK,POK,pIOK,pPOK) IV = 3 PV = 0x8189cf8 "3"\0 CUR = 1 LEN = 2
    Whoopsie. One of the elements of @arr changed, just by looking at it!

    This is the same reason why threads in Perl take quite a lot of overhead, and that by default variables are copied (and not shared) between threads.

      All true, but in the spirit of DWIMery, this is actually a benign and irrelevant side effect. The creation of PV from a number is something that will automatically re-happen again later. And thus I think hardburn's code was correct. There are no real side effects in that they cannot be recreated later identically and automatically. For example, if the "use threadsafe" was in code that called a Memoize'd function, there may still be no real side effect. And only the caller can tell that, not perl.

      my %cache; sub get_from_cache { my $var = shift; unless (exists $cache{$var}) { $cache{$var} = __PACKAGE__->new($var); } $cache{$var}; } my @foos = do { use threadsafe; map { get_from_cache($_)->foo() } @list; };

      Assuming that the object does not have non-local side effects (e.g., it reads from a data store, such as a database or the disk, but does not write to it, and the data store is considered locked from writing, or it merely calculates some stuff), such that removing it from the cache and re-creating it will create precisely the same object (especially the same foo()), then there are no technical side effects, and thus would be threadsafe, even if it's not threadsafe from a pure computer-science perspective.

      I have exactly this situation in much of my code at work. I would love to see automatic parallelisation of some of what I do - as it is, I'm forced to fork() on unix, and not use threads at all on Windows, just to get consistant, supported speed boosts where I can. Yes, I lose what modifications child processes create in my objects - but since they are all cached and completely reproducable, there is no real side effect from the child code, except for what it writes to disk.

      Of course, in order for this to work, we need a way to signify a subset of the code as serial - for example, when writing to a common logfile. Ok, I'll rephrase this: for it to work, I need that serialisation :-)

      *sigh* We can always hope for Perl6 . . .

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        That's the main reason Perl 6 has hyperoperators, and to a lesser extent the junctional operators. A pragma simply isn't specific enough. In the case of hyperoperators and junctions, we've said that it's erroneous to depend on any order of evaluation, which is a slightly weaker constraint than requiring no side effects. There can be side effects, as long as each element's side effects are independent of the other elements. Actually, even that's slightly overstated--the effects just need to be idempotent. Each branch could independently do $counter++ and it doesn't matter what order they happen in, at least for hyperops.

        The main practical difference between hyperops and junctions in this regard is that hyperops are guaranteed to run to completion, whereas junctions can short circuit whenever they jolly well please and in any order. So I guess the effective constraints on junctional operators are a bit tighter than on hyperops. Generally speaking junctional operators should have no side effects, because you won't know how many of them will happen.

        The autothreading of array indices in S9 is also intended to allow parallel execution where possible. The trend toward vector-capable processors has been evident for some time now, and we want Perl 6 to map naturally to those. Even if the Cell architecture doesn't take off, we all have rather powerful GPUs in our machines these days...

      Indeed. And in Perl, this is very, very hard. How hard? Well, your own example fails the test. Even the map {length} @array may cause a side-effect.

      Here's another example:

      use warnings; use strict; { package Tmp; use overload '""' => \&str, fallback => 1; sub c { bless [$_[1]], $_[0] } sub str { print "Hello, $_[0][0]\n"; "." x rand 10 } } my @strs = map Tmp->c($_), qw[foo bar baz]; my @str_lengths = map { length } @strs; print "$_\n" for @str_lengths;

      As you can see, this time each call to length prints something to STDOUT and calls rand, which modifies global state. This is another reason why perl can't optimize such a loop in the general case: it's very hard to prove that @strs isn't tied or contains objects with overloaded stringification.
Re: RFC: Implicit Parallelization Pragma
by Limbic~Region (Chancellor) on Jan 25, 2005 at 17:43 UTC
    hardburn,
    You didn't mention it, so I will ask. Does your proposed threadsafe pragma make any guarantees concerning the finishing order? In your example
    my @str_lengths = map { length } @strs;
    There is no reason not to get the lengths in parallel, but there certainly might be a reason to want/expect the results come out in the same order they went in (FIFO).

    Cheers - L~R

      Certainly it would have to guarentee ordered results, or it woud be very limited in usefulness.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        hardburn,
        Might I suggest you use a different example then. I didn't think you were proposing FIFO, but the example seemed that way to me. Additionally, I think having a second more constrained version that did guarantee order would still be useful. Imagine the following project:
        • Get the length of all words in the list
        • Look each word up in a dictionary
        • Count the number of syllables in a word
        Now for the sake of argument, looking each word up in the dictionary takes twice as long as both getting the length of the word and counting the number of syllables together. If these 3 tasks were done in parallel, it should take only roughly the same time as just looking up each word. On the other hand, if the order isn't guaranteed - the results coming out might not be very useful.

        In other words, the elapsed time becomes close to the length for the longest pole in the tent. Some time is also spent ordering the results as they become available. Of course, I may be way out in left field here.

        Cheers - L~R

Re: RFC: Implicit Parallelization Pragma
by Jenda (Abbot) on Jan 25, 2005 at 18:35 UTC

    The pragma is supposed to be lexical? What would

    @foo = map {use threadsafe; $_ = foo($_)} @bar;
    mean then? Is the foo() to be treated as threadsafe or not? And can the block be paralelized?

    Let's assume first that the {use threadsafe; ...} means that the contents of this block may be parallelized. No matter what subs do you call within. The programmer would have to be very very carefull then and it would take him ages to validate each and every subroutine that might get called. And the bigger the task (and thus the better candidate for parallel execution) the harder would it be to validate it. So I doubt people would use this often, if at all.

    What about if the use threadsafe; mean just that the code in the block is threadsafe, but would not say anything about the subroutines. And the interpreter would be alowed to parallelize the execution only if also all subroutines used in the block are marked threadsafe. The problem then is ... who's going to mark them? And if I mark a subroutine threadsafe, does it mean I've just said even the subroutines called from within are threadsafe? What if the implementation of a subroutine I use from a module changes and it no longer is? Or does it mean that the compiler should check the threadsafety of the subroutines used within the one it just compiles and either die or warn and ignore the mark if it uses a subroutine that's not threadsafe? The second is of course much safer, but requires changes in other people's modules!

    Jenda
    We'd like to help you learn to help yourself
    Look around you, all you see are sympathetic eyes
    Stroll around the grounds until you feel at home
       -- P. Simon in Mrs. Robinson

      Let's assume first that the {use threadsafe; ...} means that the contents of this block may be parallelized.

      This is what I intended.

      The programmer would have to be very very carefull then and it would take him ages to validate each and every subroutine that might get called. And the bigger the task (and thus the better candidate for parallel execution) the harder would it be to validate it. So I doubt people would use this often, if at all.

      If you want any concurrency without this, then you need to use fork() or threads. In that case, you still need to verify that your code doesn't have concurrency issues. A use threadsafe around important portions would take no more work than existing solutions, and is likely to take less since only code explicitly inside a threadsafe block would need to be checked.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: RFC: Implicit Parallelization Pragma
by Anonymous Monk on Jan 25, 2005 at 17:38 UTC

    I would see the benefit of this fine-grained multi-threading only if we had massively parallelised CPU architecture as well (e.g. one of those funky asynchronous chips with many many pipelines). Not only that, the OS has to allow the individual process itself manage its own threads in a very very lightweight way.

    Right now, however, even on multi-core processors, the OS still has to do a fair bit of work to establish and tear down a thread. Unless perl maintains a pool of threads that can be near-instantly reused for tiny tiny tasks, this is not too useful as a pragma.

    Nonetheless, functional programming is a tasty tasty concept that I wish had more real-life applicability in my work.

      ...the OS has to allow the individual process itself manage its own threads in a very very lightweight way.

      Win32 already has this ability, they call them fibers.

      As well as being ideal for creating pools of "units of execution", they also lend themselves to various other useful programming techniques.

      For example, you can make any two pieces of code act as coroutines almost trivially.

      What is lacking currently is a suitable Perlish API that would allow application programmers to make use of them. It might be possible to layer a Win32-only API on top of threads, but it would be so much better if, in typical perlish fashion, Perl provided an API for threads that abstracted the underlying implementation of any particular flavour and was flexible enough to allow extension for things like fibers on those platforms that support them.

      Rather than the current model which follows one platforms outdated and restrictive model to the letter, and attempts to force-fit that model on all other platforms and implementations.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        Win32 already has this ability, they call them fibers.

        Fibers will not help with parallelization. They are essentially a fancy incarnation of C's setjmp/longjmp functions. An application can change its execution context with fibers, but it cannot schedule fibers in parallel on multiple CPUs.

      These things are coming. The marketdroids from IBM/Intel/AMD/etc. are throwing around numbers like 512-core CPUs. Wheather that will come remains to be seen, but in any case, such a language feature would need to be implemented and tested now for it to be stable by the time multi-core processors are common.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: RFC: Implicit Parallelization Pragma: use parallizable
by diotalevi (Canon) on Jan 26, 2005 at 19:01 UTC

    I'd prefer you named your parallelization pragma parallelizable or such. I have similar interests in this flag except mine would be for explictly altering the execution order of blocks to allow pipelines like grep { ... } map { ... } ... to present a low profile instead of having to fully compute each section before passing on to the next operation.

    Using the parallizable pragma I can transform the following code. The benefit is I only have the minimum amount of intermediate results present at any given moment rather than the maximum.

    { use parallizable; @ary = map { ... } # block 1 = grep { ... } # block 2 = map { ... } # block 3 LIST; } { for ( LIST ) { for ( do { block 1 } ) { next if not do { block 2 }; for ( do { block 3 } ) { push @ary, $_; } } } }