in reply to Re^9: Parrot, threads & fears for the future.
in thread Parrot, threads & fears for the future.
Okay. Tilly makes three "threads and the future" points and a random thought in his post, which I hope I have correctly paraphrased here:
- Small clusters will not go away with the commoditisation of 'clusters in a box', because clusters provide not just raw throughput, but also reliability through redundancy.
This is just too difficult (for me) to predict.
- Historically, integration has increased reliability.
- In a cluster of 4 single cpu commodity boxes, you have the redundancy of 4 power supplies; 4 disks; 4 cpus; 4 sets of ram; 4 network cards.
You also have the weaknesses of 8 power cable connectors; 8 disk connectors; 4 cpu sockets; 4 - 16 ram sockets; 4 to 8 card connectors; 8 network connectors. And 1 or more network switches or hubs.
- In a 4-way multi-processor box, you only have 1 power supply, 1 disk, 1-cpu, 1 set of ram, 1 network card. That's at least 5 single-points-of-failure.
However, you only have 2 power connectors; 2 disk connectors; 1 cpu socket; 1 to 4 ram connectors; 1 or 2 card sockets; 2 network connectors; a single (socket on a) hub or switch.
Generally, integrated systems are more reliable than connected systems. Cable connector problems account for a much higher proportion of total system losses than do cpus. Simplistically, that's 37/53 points of possible failure with a cluster, compared to 10/14 the cluster in a box.
- In a cluster of 4 single cpu commodity boxes, you have the redundancy of 4 power supplies; 4 disks; 4 cpus; 4 sets of ram; 4 network cards.
- Perhaps more importantly, a lot of software running upon clusters also has a single point of failure (SPOF). The master controller/load balancer process.
In the Google arrangement, one copy of the MapReduce task is designated the Master and it distributes and coordinates all the other Map and Reduce tasks for the cluster. The Google papers make light of the possibility of Master failure:
Master Failure
It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.
which given the shear volume of their experience in the field is hard to challenge.
However, the implication is that if you have a single computation running on a cluster of N boxes, the chances of that computation suffering a failure are N times higher than if that computation is running on a single box. The other side of the coin is that with the cluster, the computation is more likely to complete, provided that the failure occurs elsewhere than the Master, whereas on the lower odds single box it would be catastrophic.
Here's a bit of probability theory that I'm not mentally equipped to solve:
- If you have an N-box cluster, with each box having a certain Mean Time Between Failures (MTBF) rating, and you randomly chose 1 of them to be your Master; versus a single box with the same MTBF rating; which is more likely to result in an incomplete computation?
- With N boxes, the odds of a failure in any given time period are N times higher than with a single box, but the odds of the failure being the Master are 1/N. N/N = 1. Without failure detection and automated hand-off, doesn't that mean that the probabilistic reliability of the two is equivalent?
- In the clustered machines scenario, each machine can only run tasks scheduled to it.
Between jobs, it still draws power. In Googles terms, an idle machine is a direct cost to the balance sheet.
Equally, the longest path through a cluster for a computation is the greatest sum of time of the tasks assigned to each processor. Ie. When a processor/box has finished its assigned tasks, it becomes idle.
Conversely, in a SMP architected multi-core system, the longest path is the total sum of the tasks divided by the number of cpus/cores. That is, each task within a computation can be time-sliced across all the processors. As smaller tasks in the computation are completed, longer running tasks get more time-slices on more cpus. The overall throughput is higher.
The energy savings from avoiding idling processors in the cluster-in-a-box (CIAB) scenario is only achieved if there is a steady flow of computations to utilise it. But even when idling, a the CIAB has 1/N the standing power draw due to losses in the power supply, standby power consumption etc., than an N-way cluster.
Of course, reliability maybe more important that power consumption for the vast majority of small cluster users. To that end, a two-way cluster of 4-core machines may be preferable to a single 8-way machine. That still doesn't address the issue of the Master/load balancer software, SPOF, but that can be addressed.
- Historically, integration has increased reliability.
- Google wants (and looks like getting), multiple, simple cores per CPU, rather than single, complex cores--but that this is for energy cost saving rather than any inherent need or desire to use multi-threading in software.
In Googles MapReduce model based around the strengths and limitations of current hardware, with its read-map-reduce-write chains, there is little reason for them to look at threading. There is little apparent scope for improving throughput by using shared memory.
Upon closer inspection, the bit between the Map and the Reduce steps, which is barely described in much of the MapReduce literature:
3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-deneed Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
proves to be quite interesting. There is a set of slides that shows that there is another step between the Map and Reduce tasks that barely raises a mention elsewhere, presumably because it is a part of the infrastructure of the systems rather than a part of the users view of it, called 'shuffle'.
In this, the key/value pairs emitted by the discrete Map tasks are buffered in memory and the written to local disk storage (The locality optimisation). What isn't clearly described is that this step does a partitioning of the outputs from each Map task and that prior to the Reduce tasks getting their hands on those partitions, the output from the many Map tasks has to be collated, aggregated and sorted. The single key/value pairs emitted by the many Map steps have to be aggregated into key/[values] (lists of values associated with each key). As the keys/value pairs can originated from any of the Map tasks (and there is generally a many to one relationship between Map tasks and Reduce tasks), this is done by distributing the output from each Map task across a set of local files. As generally, more than one key is appended to each file, these need to be sorted prior to there being processed by the Reduce tasks.
A close look at the stats & graphs on slides 11 through 15 shows that there is no overlap between Map tasks and Reduce tasks--they run serially. Although, there is overlap between Map & shuffle tasks, and between shuffle and Reduce tasks.
It seems clear (to me) that the partitioning/writing portion of the shuffle is performed by the post-amble of the Map tasks; and the reading/sorting is performed by the preamble to the Reduce tasks. Even with an insertion sort, the data is not sorted until the last value is available. <
The "current hardware" limitation comes in because the dataset size for the intermediate files is normally bigger than 32-bit hardware memory limits. This therefore requires the use of external sorts & separate sort tasks. Each adding (at least one, often more) additional read-write disk cycles to the chain.
Now imagine the removal of this process memory limitation, by the expedient of moving to 64-processors. Instead of aggregating key/values pairs by writing each pair to disk files, you simply stored these directly into a shared memory HoAs. As each Key/Value pair is emitted by a Map thread, it is partitioned by pushing its value directly into the appropriate array associated with its key. In Perl terms that means that the ImitIntermediate( $key, $value ) simply becomes:
my %intermediate; sub EmitIntermediate { my( $key, $value ) = @_; push @{ $intermediate{ $key } }, $value; return; }The sort step becomes an in-place, in-memory affair, and the iterator supplied to the Reduce thread, is simply and integer index through the sorted array.
Along the way, you've increased the benefit from the 'locality optimisation', by using local memory instead of local disk. You've avoided the need for another local read/write cycle for external sorting. And you've avoided the need for a remote file read by the Reduce task. You've also removed the need for the Master to coordinate the intermediate value locations between the Map and Reduce tasks and machines. Not to mention the gazillions of IO waits states and associated task swaps.
Most importantly, the User view of the MapReduce programming model, the Map & Reduce functions, have not changed at all. They are still small (lines of code), linear (no additional complexity), single threaded.
- Will Google go multi-threaded?
Your guess is as good as mine, probably better, but I think the strength of their MapReduce model is the transparency of the underlying implementation to the User view of it. That it lends itself to user transparent threading is not in question. Whether future hardware and OSs will make the increase in complexity of the infrastructure code worth the improvements in performance that could be derived is probably only answerable by testing it on future hardware when it's available.
- How will that affect the development costs -v- running costs -v- performance -v- reliability four-way balance?
Again, it's simply too early to call.
- How applicable are Google's needs to the rest of the world?
For traditional (seems strange to use that word in relationship to internet technologies. Time rolls on:), stateless, connectionless, HTTP serving doesn't (much) benefit from threaded servers. After the initial connection from a client, subsequent requests from that client will usually be serviced by other threads. Web applications still have to be written as a series of states (phases), which coordinate statefullness through a (3rd party process) DB keyed upon cookies/hidden fields with all the overhead and hysteresis that implies. The threads have to persist for long periods and that introduces the possibility of resource leaks over time.
Another recent innovation being used, promoted and exploited by Google is the so called Web2 interface. Otherwise known as AJAX. In this, connections persist and are re-used over the duration of one clients connection to the server. In this model, threading (used correctly) could be a definite boon. If threads are light enough (discounts the iThreads model), then you spawn a thread when a client connects and each application can become a single, linear flow with persistent connections and intermediate state held by the thread. With a clustered, load balancing server approach, this doesn't work well because each thread can only run on the box to which the initial connection was routed.
If all the processors live in the same box, then the thread can simply lay dormant (in an IO-wait state) in the shared (SMP) scheduler queue until a subsequent request arrives for it, when it is woken up, services the request and goes back to sleep. With a suitable timeout on the connection to detect clients that just "go away", the thread can persist any unsaved state to a DB and die, cleaning up its non-shared resources as it goes.
It also introduces the possibility of having supervisor threads that monitor and coordinate between clients allowing client to client communications, like the CB, to avoid the need for disk-bound persistence of transient state.
It also allows for the sharing of post-spawn data between client threads. For example, commonly used images can be cached by the process and used by all client threads. For this to work in the fork model, all common images must be pre-loaded by the HTTP daemon and then shared (via COW). A shared memory cache can be utilised from all threads, and can dynamically both grow and change to reflect dynamic demand. The same applies to other data, including code. An LRU cache of classes in shared memory can be utilised by all threads and adjust dynamically to demand.
- Will Google go multi-threaded?
- Due to its inherently side-effectful nature, Perl is not the right language for multi-threaded programming.
This is the Functional Programming view of the world, where variables are not variable, and doing anything that involves the world, including accepting read-only information from it--eg. looking at it--requires that you employ a proxy to interact with it on your behalf, and treat the world after your interaction as a completely new and different world from that you started with.
Hmm. Is that a little jaundiced? The point here is that there are plenty of other languages that do not enshew the idea of mutable variables, that have successfully integrate parallelism using threads, be they kernel or user space threads. Their secret appears to be that they recognise that whilst some uses of threading involve concurrent operations on shared data, many others do not.
Many uses of threads only involve minimal interaction with shared data, usually at the beginning and end, or once per cycle around a loop. For some other uses, threads are a convenient way of avoiding having to explicitly remember where you've got to in a piece of code whilst you do something else. Much as the stack acts as a repository for a loop counter and intermediates results store in a recursive algorithm, the local (lexical) variables already used in most Perl code to retain the current set of working parameters (state) serve the dual purpose of retaining that state across wait states and context switches whilst other threads get on with other stuff.
Threads are no different from processes in this respect. You allow the OS to suspend operations of one context of execution and get on with another whilst the first waits for something that takes a long time. The advantage that threads brings to the table is that once the wait is done, the thread can communicate its results back to the spawning thread, or other peer threads in a simple, timely manner.
This communication, in common with all asynchronous communications, has some overheads and disciplines. In the case of shared memory that means using mutex or semaphores to prevent race conditions; with pipes and sockets you have handshaking; if you use the file system, it's file and/or record locking. All else being equal, communications via shared memory involves the lowest overhead, highest performance, least primitives and simplest code.
- The declarative nature of SQL, and the tabular nature of relational data, means that RDBMSs are moving into the world of parallel query execution.
SQL certainly appears to lends itself to being threaded. However, whilst the declarative nature of the syntax suggests that parallelisation is simple, the hidden semantics of foreign keys and dependency checking; dynamically maintained indexes; and multi-user accesses mean that it is far from trivial.
In addition, for the application programmer looking to increase the performance/throughput of their application, the overheads of communicating with a DBMS, combined with the uncertainty and variability of response times due to the unknown of other application demands upon the DB server, make the benefits from this parallelisation less readily realisable than might first appear.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^11: Parrot, threads & fears for the future.
by tilly (Archbishop) on Nov 08, 2006 at 04:07 UTC | |
by BrowserUk (Patriarch) on Nov 08, 2006 at 07:20 UTC |