|Don't ask to ask, just ask|
Re^11: Parrot, threads & fears for the future.by tilly (Archbishop)
|on Nov 08, 2006 at 04:07 UTC||Need Help??|
I had missed this continuation of the thread. (No pun intended.)
About #1, there is no contest. There is a lot of literature about how to set up websites with no single points of failure. For instance you have a pair of load balancers configured for failover. If the primary goes down, the secondary takes over. No single points of failure is more reliable.
Which brings us to Google's map reduce. Suppose you have a job that will run on 2000 machines and takes 5 hours. That's over a year of machine time - the odds that some machine will fail during that job is pretty high. But the odds that the master will fail are very low. And if it does, so what? Google can just re-run the job. It is available after 10 hours, not 5. No big deal. Google is right to not worry.
This is very different than an ecommerce site. Suppose that you've got a business that does $10 million dollars of business a year. If your website is down for an hour, you've just lost about $1000 of business. However traffic varies. Depending on which hour you lost, you're really likely to be out $50 to $20,000. Murphy being Murphy (and in this case Murphy gets a lot of assistance from the fact that flaky hardware tends to fold under load), you're more likely to be out $20,000 than $50. And if you have a single server you're depending on, odds are that you can't produce, configure, and install a replacement in just one hour. So your outage is going to cost a lot more than that.
The result is that your reliability needs depend on what you're doing. Google can't afford to depend on every machine during a big computation, but it can afford to depend on one. An ecommerce site doesn't want to depend on only one machine ever. (Unless that machine is bulletproof.)
And a final huge win with the cluster. If you have a website on a cluster, it is easy to upgrade. Pick a quiet time, take half your webservers out of the rotation, upgrade their code, restart them, swap your upgraded servers with your non-upgraded servers, upgrade the other half, restart them, bring them back online. Voila, an upgrade done without taking your website offline! If you have a single machine you can't do this. Restarting a webserver is fairly slow, particularly if you cache stuff in RAM at startup. (Highly recommended.) Having your weekly upgrade not involve an outage is always a win.
OK, let's move on to #2. A big factor that I think you're missing is that keeping RAM on takes electricity. It probably isn't cost effective for Google to make their reports run faster at the cost of installing that much RAM. You're right that they could do that, but it doesn't make sense for them. However I'm sure it will for others - for instance biotech comes to mind.
And when you talk about AJAX, you've made some big assumptions that are mostly wrong (at least in today's world). Any thread of execution that is doing dynamic stuff takes up lots of resources. Be it memory, database handles, or whatever. As a result standard high performance architectures go to lengths to make the heavyweight dynamic stuff move as fast as possible from client to client. (eg They use reverse proxies so that people on slow modems don't tie up a valuable process.)
Onto #3. I disagree about Perl's main failing. Perl's main failing here is not that Perl doesn't recognize that sometimes you want to be concurrent and sometimes not, it is that there are a lot of operations in Perl that have internal side effects that you wouldn't expect to. For instance my $foo = 4; print $foo; will update $foo when you do the print. Why? Because Perl stringifies the variable, upgrades the scalar to say it can be either a number or a string, then stores the string. There are is so much of this kind of stuff going on behind your back in Perl that it is unreasonable to expect programmers to realize how much they need locks. And attempts to provide the necessary locks behind the programmer's back turned out to be a disaster. (That's why the ithread model was created.)
Perl's duck typing is the problem here. A language like C++ is better for threading not because it is easier to write code whose semantics involve no side effects, but because it is easier in C++ to inspect code and figure out when there will be potential race conditions to worry about. (I'm not saying that C++ is great for writing threaded code, just that it is better than Perl.)
About #4, I wouldn't worry about the practical difficulties. I'm not saying by that that there aren't difficulties - there are. But the database vendors know what they are and are doing their best to produce solutions. (Incidentally I've heard, and believe, that the database that does the best job of running on clusters is actually MySQL. Yes, there is something at which MySQL is technically better than the big boys!)
For the application programmer, it really depends on what your application is. I agree that Google can't just apply the relational database secret sauce and wave their problems goodbye. However for ecommerce, using a database make a lot of sense.
For ecommerce your priorities are remaining up, response time, and throughput. The economics of the situation say that as long as you have sufficiently good response time and throughput, the goal you really need to maximize is uptime. So that is the goal.
Here is a standard architecture. You have dual load balancers (set up for failover), talking to a cluster of machines (with networking set up so that everything fails over smoothly - there are no single points of failure here) and then those machines talk to a relational database. If you're big then you replicate this setup in multiple colocations so that you'll remain up even if a bomb goes off. Congratulations! Using off the shelf solutions, you've now reduced your single points of failure to one (the database) without your developers needing to do anything! Now you have to bulletproof your database, and that's it.
But it gets better. Database vendors are painfully aware that they tend to be a single point of failure, and if you're willing to pay there are plenty of high availability solutions for databases. (Again using mirroring, instant failover etc. Bonus, in some configurations the standby databases can be queried. There is an interruption in service, but it is under a second and only affects the pages that are currently being served.)
The result is that you can pretty much eliminate hardware as a cause of uptime failures by using a standard architecture which involves clusters and relational databases. There, unfortunately, are plenty of other potential causes of uptime failures. But you've gotten rid of a big one.