Re^10: Parrot, threads & fears for the future.

in reply to Re^9: Parrot, threads & fears for the future.
in thread Parrot, threads & fears for the future.

Okay. Tilly makes three "threads and the future" points and a random thought in his post, which I hope I have correctly paraphrased here:

Small clusters will not go away with the commoditisation of 'clusters in a box', because clusters provide not just raw throughput, but also reliability through redundancy.
This is just too difficult (for me) to predict.
- Historically, integration has increased reliability.
  - In a cluster of 4 single cpu commodity boxes, you have the redundancy of 4 power supplies; 4 disks; 4 cpus; 4 sets of ram; 4 network cards.
    You also have the weaknesses of 8 power cable connectors; 8 disk connectors; 4 cpu sockets; 4 - 16 ram sockets; 4 to 8 card connectors; 8 network connectors. And 1 or more network switches or hubs.
  - In a 4-way multi-processor box, you only have 1 power supply, 1 disk, 1-cpu, 1 set of ram, 1 network card. That's at least 5 single-points-of-failure.
    However, you only have 2 power connectors; 2 disk connectors; 1 cpu socket; 1 to 4 ram connectors; 1 or 2 card sockets; 2 network connectors; a single (socket on a) hub or switch.
  Generally, integrated systems are more reliable than connected systems. Cable connector problems account for a much higher proportion of total system losses than do cpus. Simplistically, that's 37/53 points of possible failure with a cluster, compared to 10/14 the cluster in a box.
- Perhaps more importantly, a lot of software running upon clusters also has a single point of failure (SPOF). The master controller/load balancer process.
  In the Google arrangement, one copy of the MapReduce task is designated the Master and it distributes and coordinates all the other Map and Reduce tasks for the cluster. The Google papers make light of the possibility of Master failure:
  
  Master Failure
  It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.
  
  which given the shear volume of their experience in the field is hard to challenge.
  However, the implication is that if you have a single computation running on a cluster of N boxes, the chances of that computation suffering a failure are N times higher than if that computation is running on a single box. The other side of the coin is that with the cluster, the computation is more likely to complete, provided that the failure occurs elsewhere than the Master, whereas on the lower odds single box it would be catastrophic.
  Here's a bit of probability theory that I'm not mentally equipped to solve:
  - If you have an N-box cluster, with each box having a certain Mean Time Between Failures (MTBF) rating, and you randomly chose 1 of them to be your Master; versus a single box with the same MTBF rating; which is more likely to result in an incomplete computation?
  - With N boxes, the odds of a failure in any given time period are N times higher than with a single box, but the odds of the failure being the Master are 1/N. N/N = 1. Without failure detection and automated hand-off, doesn't that mean that the probabilistic reliability of the two is equivalent?
- In the clustered machines scenario, each machine can only run tasks scheduled to it.
  Between jobs, it still draws power. In Googles terms, an idle machine is a direct cost to the balance sheet.
  Equally, the longest path through a cluster for a computation is the greatest sum of time of the tasks assigned to each processor. Ie. When a processor/box has finished its assigned tasks, it becomes idle.
  Conversely, in a SMP architected multi-core system, the longest path is the total sum of the tasks divided by the number of cpus/cores. That is, each task within a computation can be time-sliced across all the processors. As smaller tasks in the computation are completed, longer running tasks get more time-slices on more cpus. The overall throughput is higher.
  The energy savings from avoiding idling processors in the cluster-in-a-box (CIAB) scenario is only achieved if there is a steady flow of computations to utilise it. But even when idling, a the CIAB has 1/N the standing power draw due to losses in the power supply, standby power consumption etc., than an N-way cluster.
  Of course, reliability maybe more important that power consumption for the vast majority of small cluster users. To that end, a two-way cluster of 4-core machines may be preferable to a single 8-way machine. That still doesn't address the issue of the Master/load balancer software, SPOF, but that can be addressed.
Google wants (and looks like getting), multiple, simple cores per CPU, rather than single, complex cores--but that this is for energy cost saving rather than any inherent need or desire to use multi-threading in software.
In Googles MapReduce model based around the strengths and limitations of current hardware, with its read-map-reduce-write chains, there is little reason for them to look at threading. There is little apparent scope for improving throughput by using shared memory.
Upon closer inspection, the bit between the Map and the Reduce steps, which is barely described in much of the MapReduce literature:
3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-deneed Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

proves to be quite interesting. There is a set of slides that shows that there is another step between the Map and Reduce tasks that barely raises a mention elsewhere, presumably because it is a part of the infrastructure of the systems rather than a part of the users view of it, called 'shuffle'.
In this, the key/value pairs emitted by the discrete Map tasks are buffered in memory and the written to local disk storage (The locality optimisation). What isn't clearly described is that this step does a partitioning of the outputs from each Map task and that prior to the Reduce tasks getting their hands on those partitions, the output from the many Map tasks has to be collated, aggregated and sorted. The single key/value pairs emitted by the many Map steps have to be aggregated into key/[values] (lists of values associated with each key). As the keys/value pairs can originated from any of the Map tasks (and there is generally a many to one relationship between Map tasks and Reduce tasks), this is done by distributing the output from each Map task across a set of local files. As generally, more than one key is appended to each file, these need to be sorted prior to there being processed by the Reduce tasks.
A close look at the stats & graphs on slides 11 through 15 shows that there is no overlap between Map tasks and Reduce tasks--they run serially. Although, there is overlap between Map & shuffle tasks, and between shuffle and Reduce tasks.
It seems clear (to me) that the partitioning/writing portion of the shuffle is performed by the post-amble of the Map tasks; and the reading/sorting is performed by the preamble to the Reduce tasks. Even with an insertion sort, the data is not sorted until the last value is available. <
The "current hardware" limitation comes in because the dataset size for the intermediate files is normally bigger than 32-bit hardware memory limits. This therefore requires the use of external sorts & separate sort tasks. Each adding (at least one, often more) additional read-write disk cycles to the chain.
Now imagine the removal of this process memory limitation, by the expedient of moving to 64-processors. Instead of aggregating key/values pairs by writing each pair to disk files, you simply stored these directly into a shared memory HoAs. As each Key/Value pair is emitted by a Map thread, it is partitioned by pushing its value directly into the appropriate array associated with its key. In Perl terms that means that the ImitIntermediate( $key, $value ) simply becomes:
```
my %intermediate;
sub EmitIntermediate {
    my( $key, $value ) = @_;
    push @{ $intermediate{ $key } }, $value;
    return;
}
[download]
```
The sort step becomes an in-place, in-memory affair, and the iterator supplied to the Reduce thread, is simply and integer index through the sorted array.
Along the way, you've increased the benefit from the 'locality optimisation', by using local memory instead of local disk. You've avoided the need for another local read/write cycle for external sorting. And you've avoided the need for a remote file read by the Reduce task. You've also removed the need for the Master to coordinate the intermediate value locations between the Map and Reduce tasks and machines. Not to mention the gazillions of IO waits states and associated task swaps.
Most importantly, the User view of the MapReduce programming model, the Map & Reduce functions, have not changed at all. They are still small (lines of code), linear (no additional complexity), single threaded.
- Will Google go multi-threaded?
  Your guess is as good as mine, probably better, but I think the strength of their MapReduce model is the transparency of the underlying implementation to the User view of it. That it lends itself to user transparent threading is not in question. Whether future hardware and OSs will make the increase in complexity of the infrastructure code worth the improvements in performance that could be derived is probably only answerable by testing it on future hardware when it's available.
- How will that affect the development costs -v- running costs -v- performance -v- reliability four-way balance?
  Again, it's simply too early to call.
- How applicable are Google's needs to the rest of the world?
  For traditional (seems strange to use that word in relationship to internet technologies. Time rolls on:), stateless, connectionless, HTTP serving doesn't (much) benefit from threaded servers. After the initial connection from a client, subsequent requests from that client will usually be serviced by other threads. Web applications still have to be written as a series of states (phases), which coordinate statefullness through a (3rd party process) DB keyed upon cookies/hidden fields with all the overhead and hysteresis that implies. The threads have to persist for long periods and that introduces the possibility of resource leaks over time.
  Another recent innovation being used, promoted and exploited by Google is the so called Web2 interface. Otherwise known as AJAX. In this, connections persist and are re-used over the duration of one clients connection to the server. In this model, threading (used correctly) could be a definite boon. If threads are light enough (discounts the iThreads model), then you spawn a thread when a client connects and each application can become a single, linear flow with persistent connections and intermediate state held by the thread. With a clustered, load balancing server approach, this doesn't work well because each thread can only run on the box to which the initial connection was routed.
  If all the processors live in the same box, then the thread can simply lay dormant (in an IO-wait state) in the shared (SMP) scheduler queue until a subsequent request arrives for it, when it is woken up, services the request and goes back to sleep. With a suitable timeout on the connection to detect clients that just "go away", the thread can persist any unsaved state to a DB and die, cleaning up its non-shared resources as it goes.
  It also introduces the possibility of having supervisor threads that monitor and coordinate between clients allowing client to client communications, like the CB, to avoid the need for disk-bound persistence of transient state.
  It also allows for the sharing of post-spawn data between client threads. For example, commonly used images can be cached by the process and used by all client threads. For this to work in the fork model, all common images must be pre-loaded by the HTTP daemon and then shared (via COW). A shared memory cache can be utilised from all threads, and can dynamically both grow and change to reflect dynamic demand. The same applies to other data, including code. An LRU cache of classes in shared memory can be utilised by all threads and adjust dynamically to demand.
Due to its inherently side-effectful nature, Perl is not the right language for multi-threaded programming.
This is the Functional Programming view of the world, where variables are not variable, and doing anything that involves the world, including accepting read-only information from it--eg. looking at it--requires that you employ a proxy to interact with it on your behalf, and treat the world after your interaction as a completely new and different world from that you started with.
Hmm. Is that a little jaundiced? The point here is that there are plenty of other languages that do not enshew the idea of mutable variables, that have successfully integrate parallelism using threads, be they kernel or user space threads. Their secret appears to be that they recognise that whilst some uses of threading involve concurrent operations on shared data, many others do not.
Many uses of threads only involve minimal interaction with shared data, usually at the beginning and end, or once per cycle around a loop. For some other uses, threads are a convenient way of avoiding having to explicitly remember where you've got to in a piece of code whilst you do something else. Much as the stack acts as a repository for a loop counter and intermediates results store in a recursive algorithm, the local (lexical) variables already used in most Perl code to retain the current set of working parameters (state) serve the dual purpose of retaining that state across wait states and context switches whilst other threads get on with other stuff.
Threads are no different from processes in this respect. You allow the OS to suspend operations of one context of execution and get on with another whilst the first waits for something that takes a long time. The advantage that threads brings to the table is that once the wait is done, the thread can communicate its results back to the spawning thread, or other peer threads in a simple, timely manner.
This communication, in common with all asynchronous communications, has some overheads and disciplines. In the case of shared memory that means using mutex or semaphores to prevent race conditions; with pipes and sockets you have handshaking; if you use the file system, it's file and/or record locking. All else being equal, communications via shared memory involves the lowest overhead, highest performance, least primitives and simplest code.
The declarative nature of SQL, and the tabular nature of relational data, means that RDBMSs are moving into the world of parallel query execution.
SQL certainly appears to lends itself to being threaded. However, whilst the declarative nature of the syntax suggests that parallelisation is simple, the hidden semantics of foreign keys and dependency checking; dynamically maintained indexes; and multi-user accesses mean that it is far from trivial.
In addition, for the application programmer looking to increase the performance/throughput of their application, the overheads of communicating with a DBMS, combined with the uncertainty and variability of response times due to the unknown of other application demands upon the DB server, make the benefits from this parallelisation less readily realisable than might first appear.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^10: Parrot, threads & fears for the future. Select or Download Code

Replies are listed 'Best First'.
Re^11: Parrot, threads & fears for the future. by tilly (Archbishop) on Nov 08, 2006 at 04:07 UTC
I had missed this continuation of the thread. (No pun intended.) About #1, there is no contest. There is a lot of literature about how to set up websites with no single points of failure. For instance you have a pair of load balancers configured for failover. If the primary goes down, the secondary takes over. No single points of failure is more reliable. Which brings us to Google's map reduce. Suppose you have a job that will run on 2000 machines and takes 5 hours. That's over a year of machine time - the odds that some machine will fail during that job is pretty high. But the odds that the master will fail are very low. And if it does, so what? Google can just re-run the job. It is available after 10 hours, not 5. No big deal. Google is right to not worry. This is very different than an ecommerce site. Suppose that you've got a business that does $10 million dollars of business a year. If your website is down for an hour, you've just lost about $1000 of business. However traffic varies. Depending on which hour you lost, you're really likely to be out $50 to $20,000. Murphy being Murphy (and in this case Murphy gets a lot of assistance from the fact that flaky hardware tends to fold under load), you're more likely to be out $20,000 than $50. And if you have a single server you're depending on, odds are that you can't produce, configure, and install a replacement in just one hour. So your outage is going to cost a lot more than that. The result is that your reliability needs depend on what you're doing. Google can't afford to depend on every machine during a big computation, but it can afford to depend on one. An ecommerce site doesn't want to depend on only one machine ever. (Unless that machine is bulletproof.) And a final huge win with the cluster. If you have a website on a cluster, it is easy to upgrade. Pick a quiet time, take half your webservers out of the rotation, upgrade their code, restart them, swap your upgraded servers with your non-upgraded servers, upgrade the other half, restart them, bring them back online. Voila, an upgrade done without taking your website offline! If you have a single machine you can't do this. Restarting a webserver is fairly slow, particularly if you cache stuff in RAM at startup. (Highly recommended.) Having your weekly upgrade not involve an outage is always a win. OK, let's move on to #2. A big factor that I think you're missing is that keeping RAM on takes electricity. It probably isn't cost effective for Google to make their reports run faster at the cost of installing that much RAM. You're right that they could do that, but it doesn't make sense for them. However I'm sure it will for others - for instance biotech comes to mind. And when you talk about AJAX, you've made some big assumptions that are mostly wrong (at least in today's world). Any thread of execution that is doing dynamic stuff takes up lots of resources. Be it memory, database handles, or whatever. As a result standard high performance architectures go to lengths to make the heavyweight dynamic stuff move as fast as possible from client to client. (eg They use reverse proxies so that people on slow modems don't tie up a valuable process.) Onto #3. I disagree about Perl's main failing. Perl's main failing here is not that Perl doesn't recognize that sometimes you want to be concurrent and sometimes not, it is that there are a lot of operations in Perl that have internal side effects that you wouldn't expect to. For instance `my $foo = 4; print $foo;` will update $foo when you do the print. Why? Because Perl stringifies the variable, upgrades the scalar to say it can be either a number or a string, then stores the string. There are is so much of this kind of stuff going on behind your back in Perl that it is unreasonable to expect programmers to realize how much they need locks. And attempts to provide the necessary locks behind the programmer's back turned out to be a disaster. (That's why the ithread model was created.) Perl's duck typing is the problem here. A language like C++ is better for threading not because it is easier to write code whose semantics involve no side effects, but because it is easier in C++ to inspect code and figure out when there will be potential race conditions to worry about. (I'm not saying that C++ is great for writing threaded code, just that it is better than Perl.) About #4, I wouldn't worry about the practical difficulties. I'm not saying by that that there aren't difficulties - there are. But the database vendors know what they are and are doing their best to produce solutions. (Incidentally I've heard, and believe, that the database that does the best job of running on clusters is actually MySQL. Yes, there is something at which MySQL is technically better than the big boys!) For the application programmer, it really depends on what your application is. I agree that Google can't just apply the relational database secret sauce and wave their problems goodbye. However for ecommerce, using a database make a lot of sense. For ecommerce your priorities are remaining up, response time, and throughput. The economics of the situation say that as long as you have sufficiently good response time and throughput, the goal you really need to maximize is uptime. So that is the goal. Here is a standard architecture. You have dual load balancers (set up for failover), talking to a cluster of machines (with networking set up so that everything fails over smoothly - there are no single points of failure here) and then those machines talk to a relational database. If you're big then you replicate this setup in multiple colocations so that you'll remain up even if a bomb goes off. Congratulations! Using off the shelf solutions, you've now reduced your single points of failure to one (the database) without your developers needing to do anything! Now you have to bulletproof your database, and that's it. But it gets better. Database vendors are painfully aware that they tend to be a single point of failure, and if you're willing to pay there are plenty of high availability solutions for databases. (Again using mirroring, instant failover etc. Bonus, in some configurations the standby databases can be queried. There is an interruption in service, but it is under a second and only affects the pages that are currently being served.) The result is that you can pretty much eliminate hardware as a cause of uptime failures by using a standard architecture which involves clusters and relational databases. There, unfortunately, are plenty of other potential causes of uptime failures. But you've gotten rid of a big one.	[reply] [d/l]
Re^12: Parrot, threads & fears for the future. by BrowserUk (Patriarch) on Nov 08, 2006 at 07:20 UTC
. There is a lot of literature about how to set up websites with no single points of failure. For instance you have a pair of load balancers configured for fail-over. If the primary goes down, the secondary takes over. No single points of failure is more reliable. Remember that my proposition was that these are tomorrows commodity machines we're taking about. So, rather than todays 16-way cluster to ensure peak-time bandwidth, we only need two, but the cost of these machines is the same, so throw in a third for posterity. For your ecommerce site requirements. Instead of running 16-way by 2-core cluster, you run 2-way by 8-core cluster. Your site will have the same headroom in processor power to deal with your peaks. You also have fail-over redundancy. Which brings us to Google's map reduce. Suppose you have a job that will run on 2000 machines and takes 5 hours. See below. A big factor that I think you're missing is that keeping RAM on takes electricity. I don't believe I have missed that. Memory mapped files do not (have to or usually) exist entirely in RAM, they can and are paged on and off disk. Often through relatively small windows. The benefit the 64-bit address space is simplicity of the mapping which gets messy when mapping a 2 or 4GB address space over 1 (or more) >4GB size files. RAM: 2000 * 4GB -v- 250 * 40 GB. Disks: 4000 * 160 GB -v- 250 * 320 GB. CPUs: 2000 * 2-core -v- 250 * 8-core. Transformers: 2000 -v- 250. Etc. RAM uses less power than disks, 1/16 (or greater) reduction in numbers of disks; 1/8 reduction in most other components. That means that even if you keep the same volume of RAM but distribute it into 1/8 th as many machines, you're already saving energy. And the power draw of each generation of RAM chips/modules has either remained static or fallen, whilst the capacity has quadrupled or more with each generation. By keeping intermediate results in RAM and performing previously serial processes on it, means that you are also able to more fully utilise the time and processors. In your example above. 1 job 20005 hours of processing. Same job takes 2502 hours. 10,000 : 500 == 20:1 saving in time to process, but that's not (just) an efficiency saving. It's also a 20:1 energy saving as you don't have to (continually) run 2000 machines to process the job in a timely manner. You also have quiet-time upgrade ability. And when you talk about AJAX, you've made some big assumptions that are mostly wrong (at least in today's world). Any thread of execution that is doing dynamic stuff takes up lots of resources. Be it memory, database handles, or whatever. As a result standard high performance architectures go to lengths to make the heavyweight dynamic stuff move as fast as possible from client to client. I think you're wrong on this, but I do not know enough of current practice on AJAX sites to counter you. I disagree about Perl's main failing. Perl's main failing here is not that Perl doesn't recognize that sometimes you want to be concurrent and sometimes not, it is that there are a lot of operations in Perl that have internal side effects that you wouldn't expect to. I opened with "Due to its inherently side-effectful nature, Perl is not the right language for multi-threaded programming." and I don't think I said anything that later contradict that? I wouldn't worry about the practical difficulties. I really wasn't. My point was that without the " the hidden semantics", the parallelisation of DB operations is a natural big win, but with those semantics it's much less so. Whilst DB operations remain an out-of-box, cross-network, many-to-one communications affair, the communications overheads remain constant and contention dominates. Once you have the processor(s), address space and memory to move the DB into the same box as the application programs, the communications overheads disappear. Instead of the DBM process sharing it's limited address space between the demands of multiple callers, the common code runs in the address space of the caller. With 64-bit address space, instead of storing your tables and indexes in myriad separate files, each database becomes a single huge file mapped into memory on demand, and cached there. Disk/file locks become memory locks. Think of it like this. Just as virtual addressing is used to provide swapping for processes now, so it gets used to provide shared memory access to your databases. All the logic remains the same. Locking, caching, the works. The difference is that it all happens at RAM speed rather than disk speed, and you lost all the communications overhead and the serialisation of results through (relatively) low-speed, high latency sockets and the need for DB handles simply disappears. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]

In Section Meditations