Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Parrot, threads & fears for the future.

by BrowserUk (Pope)
on Oct 23, 2006 at 19:09 UTC ( #580131=note: print w/replies, xml ) Need Help??


in reply to Re: Parrot, threads & fears for the future.
in thread Parrot, threads & fears for the future.

Clusters are an expensive and complex, workaday solution to the memory limits imposed by 32-bit processors and 32-bit processes.

64-bit processors (theoretically) capable of addressing 16 million Petabytes. Already routinely having 4 and 16 Terabyte process address spaces. Add to that multiple cores and multiple array processors in a single core and you have the potential to do away with the latency, bandwidth restrictions and topology bottlenecks of networked clusters.

Not to mention the need to partition datasets into many files and constant shuffle data on and off disk, and between machines.

Once you can address entire huge datasets through the simple expedient of opening them as a memory mapped file, great chunks of the processing time simple disappear. All that is needed then to fully utilise the multiple processors is a few threads mostly processing independent sections of data (memory), but with the threading unique ability to share data and state directly without serialising it through high latency channels.

The future is threaded.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^2: Parrot, threads & fears for the future.

Replies are listed 'Best First'.
Re^3: Parrot, threads & fears for the future.
by tilly (Archbishop) on Oct 23, 2006 at 19:45 UTC
    Sorry, but that's just silly.

    It is a basic economic fact that price per performance for commodity hardware is far, far cheaper than for big servers. Clusters are a way for businesses to take advantage of this to get the performance and reliability they want at a much better price point.

    That 64-bit versus 32-bit is irrelevant can be trivially demonstrated. Big 64-bit servers are old news, the big Unix vendors went through that transition a decade ago. (I don't know when IBM's mainframes went through it, but I think it was earlier than that.) Yet in the last decade big iron not only did not replace clusters, but they actually lost ground to them. Why? Because clusters are a lot cheaper.

    Now I'm not denying that big machines offer performance advantages over clusters. You have correctly identified some of those advantages. And I grant that there are plenty of problems that can only be done on a big machine. If you have one of those problems, then you absolutely must swallow the pricetag and buy big iron. But if you can get away with it, you're strongly advised to get a cluster.

    Most problems do not have to run on a huge machine. Clusters are far cheaper than equivalent performance on a big machine. Neither fact seems likely to change in the forseeable future. As long as they remain true, clusters are going to remain with us.

      This years Big iron, is next years commodity hardware. This year commodity hardware is 32-bit, dual processor. Next years it will be 64-bit 2 core. The year after that, 64-bit 2 core hyperthreaded (4 cpus). The year after that...

      I admit, each of those 'years' is really a Moore's cycle. But basically, commodity hardware has fallen in price (10 to 30%) and doubled in performance at each Moore's cycle for the last few. Speed gains through decreasing die size and uping clock speeds are hitting the limits of silicon, ion beam frequency and mask resolution. For the first time in the PCs history the next cycles increase in performance will come from multi-core, multi-cpu machines.

      Intel and AMD are both talking about moving to quad core processors in 2007.

      There are already 8-way motherboards available.

      Put those together and you get a 32-way cluster in a box. If each of Intel's Quad cores is also hyperthreaded, 64 cpus in a box. What about AMD's Hypertransport and XBAR? A small, 1 Gigabit, switched network on a chip? Or IBMs Cell Architecture. If they are cheap enough to put into games machines, how long before 2 or 4 of them turn up in a PC?

      Sure, they will not be commodity priced next year, but what of the year after?

      The future is threaded ;)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        How about we have a bet on whether clusters are going away?

        I'll bet you that in 2010, people will still be building websites on clusters of commodity hardware, and there will still be a healthy market in load balancers. Furthermore I'll bet you that over 10% of the top 500 supercomputers are clusters. And finally I'll bet you that most Perl programmers won't be writing multi-threaded code. If any of those statements are wrong, you win the bet.

        Now it is true that several trends point to commodity PCs having many CPUs. However it is also true that commodity PCs tend to have many programs running on them at any given time. It is further true that for most programming problems there is an embarrassment of excess when it comes to CPU power.

        Furthermore there are lots of business problems where commodity hardware won't cut it. And nobody is about to change the fact that in that case, the cheapest way to scale is to a cluster of commodity machines.

        And another big argument against multi-threading is that it is hard to do. We have enough trouble finding people who can program semi-competently. Competently programming a multi-threaded program is harder than competently programming a single-threaded one. So even if there is a desire for more multi-threaded programming, we're not going to succeed at it until we find far better approaches.

        A final note. Computing did not begin or end with the PC. We have many kinds of computers around us, and we're going to have more. While PCs evolve into something more like a supercomputer, people are programming their cell phones, PDAs, and a host of other mobile devices. These devices have far more modest performance requirements than PCs do.

        In summary, the future holds every kind of computing we know about, and a lot of kinds that we don't.

Re^3: Parrot, threads & fears for the future.
by sandfly (Beadle) on Oct 30, 2006 at 09:07 UTC
    I want to pick up on the serialisation point.

    In my work, lots of tasks are of the form "read, transform, transform, transform, write" on quite large datasets. The read and write are always I/O bound, but the transforms can be cpu bound. I'd like to run the subtasks in parallel, to speed up execution, but it isn't worth forking processes for the transformations - the cost of serialisation/deserialisation is too high.

    What would help is a threading model like fork, except instead of standard i/o channels the threads/processes would be able to expose native datastructures for direct read and/or write by their siblings.

    Is there any facility like this in existence?

      Well, that is how (real) threading works. The problem is that Perl was not designed to support threading and trying to bolt threading on to the side of Perl has been attempted for years and nothing particularly good has been produced. The prior attempts at supporting threading in Perl were way too buggy and were eventually abandoned. The latest Perl threading tries to avoid the bugs (and is less buggy but is still buggy enough in my experience that avoiding it is usually wise) by having each thread duplicate the entire Perl interpretter. This gives us a sort of "worst of both worlds" situation (no operating system protections like we have with real fork, yet even more copying, memory use, and slowness than real fork). So I don't really consider it "threading" (more like a rather bizarre specialized use of threads, not a general-purpose threading implementation) nor an appropriate tool for the vast majority of cases.

      If you had real threads, then what you describe would be completely natural. With Perl's current threading model, even shared data structures aren't simply shared (so someone said) so I doubt it would work well to try to share a large data structure using the current Perl threads.

      Another approach for something like this is to use shared memory. Both Unix and Win32 support very nice shared memory facilities. However, Perl nearly refuses to deal with memory other than what its chosen malloc() hands to it, so getting Perl to use shared memory is quite difficult and once you get it to use it, you still end up serializing between the shared memory (that Perl won't work with directly) and Perl scalars. So, again, it likely isn't an appropriate solution for your problem.

      - tye        

      Unfortunately, the cost of using iThreads shared memory, required for the read and write buffers, is so high that using iThreads to do overlapped IO is impractical:

      cmpthese -1, { shared => q[ my $x : shared; ++$x for 1 .. 1e6 ], nonshared => q[ my $x ; ++$x for 1 .. 1e6 ], };; (warning: too few iterations for a reliable count) s/iter shared nonshared shared 1.31 -- -89% nonshared 0.141 834% --

      There are other problems also. Whilst thread == interpreter, each read and write means giving up that threads timeslice and a task switch, before the transform thread can do work. But, with interpreter == a kernel thread, when the task switch occurs, there is no guarantee (in fact very low possibility), that the transform thread will get the next timeslice as the round robin is on all kernel threads. Those of this process and all others in the system. The upshot of that is that it takes at least 3 (or more) task switches to read and transform a record and at least 3 more to write one.

      The idealised situation would be that as soon as the transform thread has got hold of the last record read, the read thread would issue the read for the next one--going straight in to the IO wait--and the transform thread would be able to continue the timeslice. You cannot arrange for that to happen using kernel threads. At least not on a single cpu processor where it would be of most benefit.

      If thread != interpreter. IE. if more than one thread could be run within a single interpreter, then you could use cooperative (user-space/user dispatched) threads (fibres in Win32 terms. unbound threads in Solaris terms), to achieve this.

      1. The transform thread copies the previously read record and transfers control to the read thread.
      2. The read thread issues an asyncIO request for the next record and then transfers control back to the transform thread.
      3. When the transform thread finishes with this record it gives it to the write thread; loops back and transfers control back to the read thread.
      4. The read thread then does it's wait for io completion, which normally will have already completed whilst the transform thread was running, so no wait occurs. So, it transfers control back to the read thread which copies the new record and we're back to step 1.

      I've truncated the write thread participation but it is essentially a mirror image of the read thread. So, with 3 cooperatively dispatched user threads running in the same kernel thread, the process is able to fully utilise every timeslice allocated to it by the OS.

      Using 3 kernel threads, 2 out of every 3 timeslices allocated to the process have to be given up almost immediately due to IO waits. The time-line for each read-transform-write cycle (simplistically) looks something like:

      read | xform | write thread | thread | thread ------------|---------------|--------------- Issue read | wait lock(in) | wait lock(out) IO wait | | | | -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- Read completes " | wait lock(out) signal record | issue next read | IO wait | wait lock(in)| -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- IOwait |obtain lock(in)| wait lock(out) | do stuff | | do stuff | | wait lock(out)| | signal write | | loop -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- | wait lock(in) | obtain lock(out) | | write out | | IO wait | | | |

      Even better than the AIO/fibres mechanism above, is overlapped-IO combined with asynchronous procedure calls (APC), but that is "too Redmond" for serious consideration here.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks to you and tye.

        It's not necesssarily as bad as two-out-of-three timeslices immediately given up for I/O - the transform could potentially consume many timeslices. However I take your point that the reader and writer will probably make poor use of their timeslices.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://580131]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2021-06-16 18:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (76 votes). Check out past polls.

    Notices?