Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re^7: Parrot, threads & fears for the future.

by tilly (Archbishop)
on Oct 25, 2006 at 00:46 UTC ( #580443=note: print w/replies, xml ) Need Help??

in reply to Re^6: Parrot, threads & fears for the future.
in thread Parrot, threads & fears for the future.

I still strongly disagree.

For instance you say that, That those currently having to use the smaller (4- to 32-way) clusters to get their work done will soon no longer need to deal with the latency, restricted bandwidth and topological problems involved with clusters, nor the complexity and expense of cluster-based software solutions, because they'll be able to use simpler, cheaper, cluster in a box solutions. And thereby miss the huge point that people use clusters to achieve a combination of performance and reliability.

Take, for instance, the lowly webserver cluster. Even if you're serving a million dynamic pages per day, your webserver requirements are pretty modest. 2, maybe 3 decent pizza boxes covers it. However you want more machines than that. Maybe you have 5 or so. Why the extra machines? Because if something goes wrong with one of them, you can just pull it out of service and work on it. You want both performance and reliability. And reliability is achieved through having redundant hardware.

Replacing a small cluster like this with one machine would be a stupid idea unless that one machine was engineered for an insane degree of reliability. At which point it is going to cost so much more that you'd be dumb to choose that route. (Unless the machine was already available for some other reason...)

Because of this fact I guarantee that small clusters of machines are not going away any time soon. Moore's Law may deliver machines whose performance exceeds the need, but the needs of commodity users guarantees that there is no pressure to make those machines reliable enough for many business uses. And businesses will achieve that reliability through clustering.

About what Google will and will not do, this will make you happy. Their request of chip designers (which looks like it is being paid attention to) is more multi-threading on chip, and less speculation in execution plans. The reason for this is that any work that computers do takes electricity. Therefore if chips do a lot of code analysis and speculative out-of-order operations, they use more electricity per completed instruction than they do if they have multiple parallel executing threads. Given that declining hardware costs make electricity an ever bigger component of total cost, multi-threaded cores are more power efficient than single-threaded ones.

That said, Google does not have people do a lot of multi-threaded programming. Their approach consists of having people doing a lot of single-threaded programming and then putting them together in easily parallelizable chunks. Yes, I'm aware that this is similar to your proposal for how to actually write multi-threaded code. Yes, I'm aware that pugs is implementing something like this. However my impression remains that Perl is a bad language to try to do this with simply because it is so darned hard to prove that arbitrary code won't have side-effects and therefore can be parallelized.

I could be wrong about that, and I'll applaud loudly if people prove me wrong by succeeding very well at it.

(Random note. Another kind of "easily parallizable chunk" is a piece of SQL. Databases have moved towards doing parallel query execution, and will move farther that way as time goes on. Again the programmer just writes single-threaded code. But behind the scenes it is parallelized. However in this case the programming language does none of the work.)

A final note about personal experience. I only bring up mine when I think it really is relevant to the point at hand. Which has nothing to do with breadth, and a lot to do with specifics. You definitely have been in computers longer than I have, and it sounds like you've done more kinds of things with them than I have. But that won't change whether or not my experience has any bearing on a specific example at hand. And if it does, I'll mention it. And if it doesn't, I won't.

You'll note that I wasn't the one who brought my experience up in this thread. You were. And you brought up a whole ton of experience and then said nothing relevant to the point that you were talking to. Which was my claim that even if the PC heads towards multi-threaded implementations, that doesn't mean that programming is all headed towards being multi-threaded.If you think I was trying to "put you in your place" based on my depth of experience, then I'd suggest re-reading the thread. You made a claim that I thought was silly, and I responded honestly to it. (For the record, I think that any claim of the form, The future of programming is X is silly. Programming is going in too many directions at once to be so simply characterized.)

You'll also note that you didn't say that your wide experience contradicts the point that I made about different kinds of computing devices being out there. Which is that even if the PC is headed towards having many parallel threads of execution, that doesn't indicate that one can simply say that threading is the future of programming.

  • Comment on Re^7: Parrot, threads & fears for the future.

Replies are listed 'Best First'.
Re^8: Parrot, threads & fears for the future.
by BrowserUk (Pope) on Oct 25, 2006 at 00:53 UTC
    that doesn't mean that programming is all headed towards being multi-threaded.

    Straw man.


    I apologise for this post.

    It was stupid and crass and exactly what I've taken others to task for doing in the past. (Ie. Picking out one element and using it dismiss the entire post).

    I am sorry and will follow up in detail if anyone is interested?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You could follow up in detail just there instead of apologizing.

      I'm always interested in good posts, specially in those debating controversial matters with a low flame/info ratio. Since you shurely don't expect exhortation from each of us monks I boldly stand up and shout "Yes, We Are!". From the votes cast upon this node I'll deduce whether that has been a good idea, and they might also answer your question... ;-)


      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

        Okay. Tilly makes three "threads and the future" points and a random thought in his post, which I hope I have correctly paraphrased here:

        1. Small clusters will not go away with the commoditisation of 'clusters in a box', because clusters provide not just raw throughput, but also reliability through redundancy.

          This is just too difficult (for me) to predict.

          • Historically, integration has increased reliability.
            • In a cluster of 4 single cpu commodity boxes, you have the redundancy of 4 power supplies; 4 disks; 4 cpus; 4 sets of ram; 4 network cards.

              You also have the weaknesses of 8 power cable connectors; 8 disk connectors; 4 cpu sockets; 4 - 16 ram sockets; 4 to 8 card connectors; 8 network connectors. And 1 or more network switches or hubs.

            • In a 4-way multi-processor box, you only have 1 power supply, 1 disk, 1-cpu, 1 set of ram, 1 network card. That's at least 5 single-points-of-failure.

              However, you only have 2 power connectors; 2 disk connectors; 1 cpu socket; 1 to 4 ram connectors; 1 or 2 card sockets; 2 network connectors; a single (socket on a) hub or switch.

            Generally, integrated systems are more reliable than connected systems. Cable connector problems account for a much higher proportion of total system losses than do cpus. Simplistically, that's 37/53 points of possible failure with a cluster, compared to 10/14 the cluster in a box.

          • Perhaps more importantly, a lot of software running upon clusters also has a single point of failure (SPOF). The master controller/load balancer process.

            In the Google arrangement, one copy of the MapReduce task is designated the Master and it distributes and coordinates all the other Map and Reduce tasks for the cluster. The Google papers make light of the possibility of Master failure:

            Master Failure

            It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

            which given the shear volume of their experience in the field is hard to challenge.

            However, the implication is that if you have a single computation running on a cluster of N boxes, the chances of that computation suffering a failure are N times higher than if that computation is running on a single box. The other side of the coin is that with the cluster, the computation is more likely to complete, provided that the failure occurs elsewhere than the Master, whereas on the lower odds single box it would be catastrophic.

            Here's a bit of probability theory that I'm not mentally equipped to solve:

            • If you have an N-box cluster, with each box having a certain Mean Time Between Failures (MTBF) rating, and you randomly chose 1 of them to be your Master; versus a single box with the same MTBF rating; which is more likely to result in an incomplete computation?
            • With N boxes, the odds of a failure in any given time period are N times higher than with a single box, but the odds of the failure being the Master are 1/N. N/N = 1. Without failure detection and automated hand-off, doesn't that mean that the probabilistic reliability of the two is equivalent?
          • In the clustered machines scenario, each machine can only run tasks scheduled to it.

            Between jobs, it still draws power. In Googles terms, an idle machine is a direct cost to the balance sheet.

            Equally, the longest path through a cluster for a computation is the greatest sum of time of the tasks assigned to each processor. Ie. When a processor/box has finished its assigned tasks, it becomes idle.

            Conversely, in a SMP architected multi-core system, the longest path is the total sum of the tasks divided by the number of cpus/cores. That is, each task within a computation can be time-sliced across all the processors. As smaller tasks in the computation are completed, longer running tasks get more time-slices on more cpus. The overall throughput is higher.

            The energy savings from avoiding idling processors in the cluster-in-a-box (CIAB) scenario is only achieved if there is a steady flow of computations to utilise it. But even when idling, a the CIAB has 1/N the standing power draw due to losses in the power supply, standby power consumption etc., than an N-way cluster.

            Of course, reliability maybe more important that power consumption for the vast majority of small cluster users. To that end, a two-way cluster of 4-core machines may be preferable to a single 8-way machine. That still doesn't address the issue of the Master/load balancer software, SPOF, but that can be addressed.

        2. Google wants (and looks like getting), multiple, simple cores per CPU, rather than single, complex cores--but that this is for energy cost saving rather than any inherent need or desire to use multi-threading in software.

          In Googles MapReduce model based around the strengths and limitations of current hardware, with its read-map-reduce-write chains, there is little reason for them to look at threading. There is little apparent scope for improving throughput by using shared memory.

          Upon closer inspection, the bit between the Map and the Reduce steps, which is barely described in much of the MapReduce literature:

          3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-deneed Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.

          4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

          proves to be quite interesting. There is a set of slides that shows that there is another step between the Map and Reduce tasks that barely raises a mention elsewhere, presumably because it is a part of the infrastructure of the systems rather than a part of the users view of it, called 'shuffle'.

          In this, the key/value pairs emitted by the discrete Map tasks are buffered in memory and the written to local disk storage (The locality optimisation). What isn't clearly described is that this step does a partitioning of the outputs from each Map task and that prior to the Reduce tasks getting their hands on those partitions, the output from the many Map tasks has to be collated, aggregated and sorted. The single key/value pairs emitted by the many Map steps have to be aggregated into key/[values] (lists of values associated with each key). As the keys/value pairs can originated from any of the Map tasks (and there is generally a many to one relationship between Map tasks and Reduce tasks), this is done by distributing the output from each Map task across a set of local files. As generally, more than one key is appended to each file, these need to be sorted prior to there being processed by the Reduce tasks.

          A close look at the stats & graphs on slides 11 through 15 shows that there is no overlap between Map tasks and Reduce tasks--they run serially. Although, there is overlap between Map & shuffle tasks, and between shuffle and Reduce tasks.

          It seems clear (to me) that the partitioning/writing portion of the shuffle is performed by the post-amble of the Map tasks; and the reading/sorting is performed by the preamble to the Reduce tasks. Even with an insertion sort, the data is not sorted until the last value is available. <

          The "current hardware" limitation comes in because the dataset size for the intermediate files is normally bigger than 32-bit hardware memory limits. This therefore requires the use of external sorts & separate sort tasks. Each adding (at least one, often more) additional read-write disk cycles to the chain.

          Now imagine the removal of this process memory limitation, by the expedient of moving to 64-processors. Instead of aggregating key/values pairs by writing each pair to disk files, you simply stored these directly into a shared memory HoAs. As each Key/Value pair is emitted by a Map thread, it is partitioned by pushing its value directly into the appropriate array associated with its key. In Perl terms that means that the ImitIntermediate( $key, $value ) simply becomes:

          my %intermediate; sub EmitIntermediate { my( $key, $value ) = @_; push @{ $intermediate{ $key } }, $value; return; }

          The sort step becomes an in-place, in-memory affair, and the iterator supplied to the Reduce thread, is simply and integer index through the sorted array.

          Along the way, you've increased the benefit from the 'locality optimisation', by using local memory instead of local disk. You've avoided the need for another local read/write cycle for external sorting. And you've avoided the need for a remote file read by the Reduce task. You've also removed the need for the Master to coordinate the intermediate value locations between the Map and Reduce tasks and machines. Not to mention the gazillions of IO waits states and associated task swaps.

          Most importantly, the User view of the MapReduce programming model, the Map & Reduce functions, have not changed at all. They are still small (lines of code), linear (no additional complexity), single threaded.

          • Will Google go multi-threaded?

            Your guess is as good as mine, probably better, but I think the strength of their MapReduce model is the transparency of the underlying implementation to the User view of it. That it lends itself to user transparent threading is not in question. Whether future hardware and OSs will make the increase in complexity of the infrastructure code worth the improvements in performance that could be derived is probably only answerable by testing it on future hardware when it's available.

          • How will that affect the development costs -v- running costs -v- performance -v- reliability four-way balance?

            Again, it's simply too early to call.

          • How applicable are Google's needs to the rest of the world?

            For traditional (seems strange to use that word in relationship to internet technologies. Time rolls on:), stateless, connectionless, HTTP serving doesn't (much) benefit from threaded servers. After the initial connection from a client, subsequent requests from that client will usually be serviced by other threads. Web applications still have to be written as a series of states (phases), which coordinate statefullness through a (3rd party process) DB keyed upon cookies/hidden fields with all the overhead and hysteresis that implies. The threads have to persist for long periods and that introduces the possibility of resource leaks over time.

            Another recent innovation being used, promoted and exploited by Google is the so called Web2 interface. Otherwise known as AJAX. In this, connections persist and are re-used over the duration of one clients connection to the server. In this model, threading (used correctly) could be a definite boon. If threads are light enough (discounts the iThreads model), then you spawn a thread when a client connects and each application can become a single, linear flow with persistent connections and intermediate state held by the thread. With a clustered, load balancing server approach, this doesn't work well because each thread can only run on the box to which the initial connection was routed.

            If all the processors live in the same box, then the thread can simply lay dormant (in an IO-wait state) in the shared (SMP) scheduler queue until a subsequent request arrives for it, when it is woken up, services the request and goes back to sleep. With a suitable timeout on the connection to detect clients that just "go away", the thread can persist any unsaved state to a DB and die, cleaning up its non-shared resources as it goes.

            It also introduces the possibility of having supervisor threads that monitor and coordinate between clients allowing client to client communications, like the CB, to avoid the need for disk-bound persistence of transient state.

            It also allows for the sharing of post-spawn data between client threads. For example, commonly used images can be cached by the process and used by all client threads. For this to work in the fork model, all common images must be pre-loaded by the HTTP daemon and then shared (via COW). A shared memory cache can be utilised from all threads, and can dynamically both grow and change to reflect dynamic demand. The same applies to other data, including code. An LRU cache of classes in shared memory can be utilised by all threads and adjust dynamically to demand.

        3. Due to its inherently side-effectful nature, Perl is not the right language for multi-threaded programming.

          This is the Functional Programming view of the world, where variables are not variable, and doing anything that involves the world, including accepting read-only information from it--eg. looking at it--requires that you employ a proxy to interact with it on your behalf, and treat the world after your interaction as a completely new and different world from that you started with.

          Hmm. Is that a little jaundiced? The point here is that there are plenty of other languages that do not enshew the idea of mutable variables, that have successfully integrate parallelism using threads, be they kernel or user space threads. Their secret appears to be that they recognise that whilst some uses of threading involve concurrent operations on shared data, many others do not.

          Many uses of threads only involve minimal interaction with shared data, usually at the beginning and end, or once per cycle around a loop. For some other uses, threads are a convenient way of avoiding having to explicitly remember where you've got to in a piece of code whilst you do something else. Much as the stack acts as a repository for a loop counter and intermediates results store in a recursive algorithm, the local (lexical) variables already used in most Perl code to retain the current set of working parameters (state) serve the dual purpose of retaining that state across wait states and context switches whilst other threads get on with other stuff.

          Threads are no different from processes in this respect. You allow the OS to suspend operations of one context of execution and get on with another whilst the first waits for something that takes a long time. The advantage that threads brings to the table is that once the wait is done, the thread can communicate its results back to the spawning thread, or other peer threads in a simple, timely manner.

          This communication, in common with all asynchronous communications, has some overheads and disciplines. In the case of shared memory that means using mutex or semaphores to prevent race conditions; with pipes and sockets you have handshaking; if you use the file system, it's file and/or record locking. All else being equal, communications via shared memory involves the lowest overhead, highest performance, least primitives and simplest code.

        4. The declarative nature of SQL, and the tabular nature of relational data, means that RDBMSs are moving into the world of parallel query execution.

          SQL certainly appears to lends itself to being threaded. However, whilst the declarative nature of the syntax suggests that parallelisation is simple, the hidden semantics of foreign keys and dependency checking; dynamically maintained indexes; and multi-user accesses mean that it is far from trivial.

          In addition, for the application programmer looking to increase the performance/throughput of their application, the overheads of communicating with a DBMS, combined with the uncertainty and variability of response times due to the unknown of other application demands upon the DB server, make the benefits from this parallelisation less readily realisable than might first appear.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://580443]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2021-09-17 14:24 GMT
Find Nodes?
    Voting Booth?

    No recent polls found