in reply to Parrot, threads & fears for the future.

You lost me at, The future is threaded.

High performance work tends to move parallel into clusters, and it has been known for years now that forked processes are easier to scale for clusters than multi-threaded processes. The reasons have nothing to do with Unix versus Windows, and everything to do with minimizing necessary interactions between parallel jobs. (And moreover, letting the OS know that interactions will be minimized.)

Naive parallelism also has a great future. Think "webserver". Performance is just fine with unthreaded code, it is easier to manage development, and you can get as much parallelism as you need by running lots of concurrent processes.

I'm sure that there is also a great future for threaded programs. However my current take is that that future will tend to be either very specialized code, or else for native GUI applications.

Now if Perl 6 wants to be all things to all people, it probably should include support for threaded programming. But even if it has great support for that (and Audrey's implementation may), I'd be willing to bet that the multi-threaded part of Perl 6 will not be used in most Perl.

  • Comment on Re: Parrot, threads & fears for the future.

Replies are listed 'Best First'.
Re^2: Parrot, threads & fears for the future.
by chromatic (Archbishop) on Oct 23, 2006 at 18:35 UTC
    You lost me at, The future is threaded.

    Indeed; there's a reason the shared-nothing architecture tends to beat shared-everything in high-volume web serving, for example. Now the feature may be parallel, but threaded?

Re^2: Parrot, threads & fears for the future.
by BrowserUk (Pope) on Oct 23, 2006 at 19:09 UTC

    Clusters are an expensive and complex, workaday solution to the memory limits imposed by 32-bit processors and 32-bit processes.

    64-bit processors (theoretically) capable of addressing 16 million Petabytes. Already routinely having 4 and 16 Terabyte process address spaces. Add to that multiple cores and multiple array processors in a single core and you have the potential to do away with the latency, bandwidth restrictions and topology bottlenecks of networked clusters.

    Not to mention the need to partition datasets into many files and constant shuffle data on and off disk, and between machines.

    Once you can address entire huge datasets through the simple expedient of opening them as a memory mapped file, great chunks of the processing time simple disappear. All that is needed then to fully utilise the multiple processors is a few threads mostly processing independent sections of data (memory), but with the threading unique ability to share data and state directly without serialising it through high latency channels.

    The future is threaded.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Sorry, but that's just silly.

      It is a basic economic fact that price per performance for commodity hardware is far, far cheaper than for big servers. Clusters are a way for businesses to take advantage of this to get the performance and reliability they want at a much better price point.

      That 64-bit versus 32-bit is irrelevant can be trivially demonstrated. Big 64-bit servers are old news, the big Unix vendors went through that transition a decade ago. (I don't know when IBM's mainframes went through it, but I think it was earlier than that.) Yet in the last decade big iron not only did not replace clusters, but they actually lost ground to them. Why? Because clusters are a lot cheaper.

      Now I'm not denying that big machines offer performance advantages over clusters. You have correctly identified some of those advantages. And I grant that there are plenty of problems that can only be done on a big machine. If you have one of those problems, then you absolutely must swallow the pricetag and buy big iron. But if you can get away with it, you're strongly advised to get a cluster.

      Most problems do not have to run on a huge machine. Clusters are far cheaper than equivalent performance on a big machine. Neither fact seems likely to change in the forseeable future. As long as they remain true, clusters are going to remain with us.

        This years Big iron, is next years commodity hardware. This year commodity hardware is 32-bit, dual processor. Next years it will be 64-bit 2 core. The year after that, 64-bit 2 core hyperthreaded (4 cpus). The year after that...

        I admit, each of those 'years' is really a Moore's cycle. But basically, commodity hardware has fallen in price (10 to 30%) and doubled in performance at each Moore's cycle for the last few. Speed gains through decreasing die size and uping clock speeds are hitting the limits of silicon, ion beam frequency and mask resolution. For the first time in the PCs history the next cycles increase in performance will come from multi-core, multi-cpu machines.

        Intel and AMD are both talking about moving to quad core processors in 2007.

        There are already 8-way motherboards available.

        Put those together and you get a 32-way cluster in a box. If each of Intel's Quad cores is also hyperthreaded, 64 cpus in a box. What about AMD's Hypertransport and XBAR? A small, 1 Gigabit, switched network on a chip? Or IBMs Cell Architecture. If they are cheap enough to put into games machines, how long before 2 or 4 of them turn up in a PC?

        Sure, they will not be commodity priced next year, but what of the year after?

        The future is threaded ;)

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      I want to pick up on the serialisation point.

      In my work, lots of tasks are of the form "read, transform, transform, transform, write" on quite large datasets. The read and write are always I/O bound, but the transforms can be cpu bound. I'd like to run the subtasks in parallel, to speed up execution, but it isn't worth forking processes for the transformations - the cost of serialisation/deserialisation is too high.

      What would help is a threading model like fork, except instead of standard i/o channels the threads/processes would be able to expose native datastructures for direct read and/or write by their siblings.

      Is there any facility like this in existence?

        Well, that is how (real) threading works. The problem is that Perl was not designed to support threading and trying to bolt threading on to the side of Perl has been attempted for years and nothing particularly good has been produced. The prior attempts at supporting threading in Perl were way too buggy and were eventually abandoned. The latest Perl threading tries to avoid the bugs (and is less buggy but is still buggy enough in my experience that avoiding it is usually wise) by having each thread duplicate the entire Perl interpretter. This gives us a sort of "worst of both worlds" situation (no operating system protections like we have with real fork, yet even more copying, memory use, and slowness than real fork). So I don't really consider it "threading" (more like a rather bizarre specialized use of threads, not a general-purpose threading implementation) nor an appropriate tool for the vast majority of cases.

        If you had real threads, then what you describe would be completely natural. With Perl's current threading model, even shared data structures aren't simply shared (so someone said) so I doubt it would work well to try to share a large data structure using the current Perl threads.

        Another approach for something like this is to use shared memory. Both Unix and Win32 support very nice shared memory facilities. However, Perl nearly refuses to deal with memory other than what its chosen malloc() hands to it, so getting Perl to use shared memory is quite difficult and once you get it to use it, you still end up serializing between the shared memory (that Perl won't work with directly) and Perl scalars. So, again, it likely isn't an appropriate solution for your problem.

        - tye        

        Unfortunately, the cost of using iThreads shared memory, required for the read and write buffers, is so high that using iThreads to do overlapped IO is impractical:

        cmpthese -1, { shared => q[ my $x : shared; ++$x for 1 .. 1e6 ], nonshared => q[ my $x ; ++$x for 1 .. 1e6 ], };; (warning: too few iterations for a reliable count) s/iter shared nonshared shared 1.31 -- -89% nonshared 0.141 834% --

        There are other problems also. Whilst thread == interpreter, each read and write means giving up that threads timeslice and a task switch, before the transform thread can do work. But, with interpreter == a kernel thread, when the task switch occurs, there is no guarantee (in fact very low possibility), that the transform thread will get the next timeslice as the round robin is on all kernel threads. Those of this process and all others in the system. The upshot of that is that it takes at least 3 (or more) task switches to read and transform a record and at least 3 more to write one.

        The idealised situation would be that as soon as the transform thread has got hold of the last record read, the read thread would issue the read for the next one--going straight in to the IO wait--and the transform thread would be able to continue the timeslice. You cannot arrange for that to happen using kernel threads. At least not on a single cpu processor where it would be of most benefit.

        If thread != interpreter. IE. if more than one thread could be run within a single interpreter, then you could use cooperative (user-space/user dispatched) threads (fibres in Win32 terms. unbound threads in Solaris terms), to achieve this.

        1. The transform thread copies the previously read record and transfers control to the read thread.
        2. The read thread issues an asyncIO request for the next record and then transfers control back to the transform thread.
        3. When the transform thread finishes with this record it gives it to the write thread; loops back and transfers control back to the read thread.
        4. The read thread then does it's wait for io completion, which normally will have already completed whilst the transform thread was running, so no wait occurs. So, it transfers control back to the read thread which copies the new record and we're back to step 1.

        I've truncated the write thread participation but it is essentially a mirror image of the read thread. So, with 3 cooperatively dispatched user threads running in the same kernel thread, the process is able to fully utilise every timeslice allocated to it by the OS.

        Using 3 kernel threads, 2 out of every 3 timeslices allocated to the process have to be given up almost immediately due to IO waits. The time-line for each read-transform-write cycle (simplistically) looks something like:

        read | xform | write thread | thread | thread ------------|---------------|--------------- Issue read | wait lock(in) | wait lock(out) IO wait | | | | -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- Read completes " | wait lock(out) signal record | issue next read | IO wait | wait lock(in)| -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- IOwait |obtain lock(in)| wait lock(out) | do stuff | | do stuff | | wait lock(out)| | signal write | | loop -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- | wait lock(in) | obtain lock(out) | | write out | | IO wait | | | |

        Even better than the AIO/fibres mechanism above, is overlapped-IO combined with asynchronous procedure calls (APC), but that is "too Redmond" for serious consideration here.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.