http://www.perlmonks.org?node_id=936804


in reply to Re^4: Multithreaded (or similar) access to a complex data structure
in thread Multithreaded (or similar) access to a complex data structure

Well, you've certainly reminded me why I hate and fear multithreading.

The problems I describe are not confined to "multi-threading"; they affect multi-processing of all forms.

For example, the typical approach to multiprocessing web applications is to use pre-forking server to run the code and an RDBMS to store the data. The oft misunderstood 'benefit' of this approach is that as forks don't share state, they don't need locking. But this completely misses that all that has happened is the shared state has been moved into the DB and it has to do the locking for you, and so suffers from all the same problems of exponential lock contention as the number of clients trying to access the same dataset increase.

It is exactly these problems with data access through a central "database manager" not scaling to deal with hyper-scale web applications that is the driving force behind the move away from RDBMSs in favour of the whole raft of distributed management data stores broadly categorised under the title NoSQL. Hence you get Google's BigTable; CouchDB; MongoDB; Terastore etc.

Back in the days when the biggest distributed apps were banks and credit cards with a few 10,000s of clients processing a few millions of data accesses per day, routing all those accesses through a central DBM worked. It required BigIron, highly structured and indexed data and very few, very well-defined queries, but it worked and worked well.

Then suddenly you get hyper-scale web applications where you have millions of concurrent clients and billions of transactions every day, asking a myriad of free-form queries against huge and broadly unstructured datasets. Then, having all your clients talking to one central DBM managing one huge data store is not just hugely expensive it is quite simply impossible. BigIron cannot get that big. And so 'the cloud' was born.

The only way forward is to distribute your dataset. Note that distribute is not the same as replicate. Subset the dataset into manageable chunks and have different processors (or clusters of processors) managing those discrete chunks. But then, your clients can no longer talk directly to a single DBM because each DBM only has access to a small subset of the overall data. Instead, clients talk to lightweight front-ends that know enough to be able to break up the inbound query into sub-queries which they route to the distributed DBMs as required. They then gather the various responses from those back-end DBMs and collate the results before finally wrapping it in the presentation layer and sending the reply back to the client.

It requires new architectures and new thinking, but the result is that applications can be scaled by expanding width-ways -- adding more cheap, commodity boxes at the front or back as required -- rather than having to buy bigger and bigger individual boxes at both ends as used to be the case.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^5: Multithreaded (or similar) access to a complex data structure