(Fwiw, in my original comment I meant that a problem that is as you describe is really IO bound
(I had some trouble parsing that sentence; I hope i got you right.)
The problems I describe where concurrency does help IO bound problems, are "really IO bound".
- With multiple physical drives:
A simple filter -- read modify write (eg. grep) -- can be reading the next record from the source whilst the previous write to the destination completes. Runtime can be more than halved.
- Remote drives:
By overlapping the latencies in the reads and writes; throughput can be increased substantially.
The write time for SSDs is usually substantially greater than the read time. For tasks that write less records than are written -- eg. line filters etc. -- overlapping the two can increase throughput.
- Temporal locality:
The point is that your statement implies that all IO bound processes cannot benefit from threading; where the truth is that only a few types cannot realise any benefit.
- With multiple physical drives:
I think you've misinterpreted The Problem with Threads. "Nondeterminism should be explicitly and judiciously introduced where needed, rather than removed where not needed."'
Actually, no I haven't. And your quote proves it.
The single biggest source of both bugs & design fails, and unrealised performance, is synchronisation and locking. Ie. Attempts to impose determinism.
People insist on trying to enforce the determinism that when this thread has finished with this piece of data; that thread starts processing it. They'll use semaphores, or locks, or signals, or mutexes, and often some combination of them. And with them come all the nasties wrongly attributed to "threading"; dead-locks, live-locks, and priority inversions; often transient, and always a nightmare to diagnose and fix.
And even when they get it right, all they've done is create a lock-stepped, and thus sequential process. Almost, if not exactly, as if no threading had been done.
Except of course, they've carefully orchestrated that there be at least one, and often multiple, context swaps occurring between steps. And then they wonder why it's slowed everything down. By attempting to deterministically control the flow of data between threads, you're working against the system scheduler and guaranteeing that no thread ever uses its full timeslice. Indeed, many times you're actually forcing the scheduler to load a thread only to immediately remove it again because the other thread that you're trying to pass control to hasn't yet run.
The archetypal example of this is the Global Interpreter Lock as seen in Python, Ruby et al.
Think of this like a hub airport would be if there was no buffer space -- aprons and gates. Planes would have to circle until one of the connecting flights arrived, then both land, exchange passengers; then both take off and circle again until another connecting flight was available. A stupid, but surprisingly accurate analogy.
And people forget -- and that includes academics who tend to test and benchmark their algorithms in isolation of other workloads -- that the scheduler isn't just scheduling the threads of their process, but also the threads of every other process in the system. If you fail to utilise the full timeslice, the remainder doesn't automatically get transferred to another thread of your process, but most times gets lost entirely because (the thread of) another process, and often many other processes, get scheduled before your process gets another chance. And even when it (your process) does get another slot, it may be the wrong thread, or even the same thread that just abandoned its last one.
To make best use of threading, you need to embrace non-determinism, by eschewing locking and synchronisation and the conditional checks associated with them -- unless it is absolutely required, which with appropriately consider algorithms, is very rare -- and allow things to free run.
By providing low-overhead, flexible sized buffers (queues) between sequential sections of processing, you allow the well-tried and proven system scheduler with its multiple classes, dynamic priorities, dynamic timeslices et al. to give timeslices to threads that need them, in a timely manner.
Imagine the complexity of locking required by a shared hash with multiple reading and writing threads performing concurrent access. And imagine trying to test it; and then debug it.
Then, if you are really interested, go off and watch Cliff Click describe his lock-free hash table. Watch it all. See how he scales a hash table to be used concurrently from thousands of threads. See also the discussion of the bottlenecks with the standard Java ConcurrentHashMap.
Bottom line: Everything in that paper (The Problem with Threads) tackles the "problem" from the completely wrong direction. Had it not been dated 2006, I would have assumed that it had been written circa. 1995 or earlier, such is the level of dated thinking it employs.
I don't think your point is that far adrift from (Perl) "should not hide the existence of OS-level threads, or fail to provide access to lower level concurrency control constructs".
The sentence you quote -- "But equally, don't throw the baby out with the bath water. Flames are dangerous; but oh so very useful." -- is (in section 2 of my reply) aimed directly at the paper you cite.
However, your quote from the P6 spec. is also questionable if the resulting low-level constructs end up being anything like Perl5's.
Simply exposing the low-level system constructs, especially if they are forced into being POSIX compatible emulations over the underlying OS is arguably worse. You'd just be recreating all the bad bits of the Perl5 implementation.
And the addition of Futures as the only alternative will do little to make things better. See below.
Are you really saying that, when solving any concurrent problem correctly (data-wise), Futures are never (or seldom) simpler for the programmers writing, reading, and modifying the code than directly using the underlying low level concurrency constructs ?
No. I like Futures a lot. They are the best (cleanest; easiest to grasp), high level threading construct I've yet seen. For a certain class of problems, a well implemented -- and that means NOT the obvious, naive implementation -- they can/could be a very simple and effective solution.
But, I'm am saying that they move the problems somewhere else.
In effect, all (naive) Futures do, is wrap-over (hide) the mechanics of ThreadCreate() and ThreadJoin(), And all their problems still persist.
When a Future comes to be realised -- the thread comes to be joined -- if that thread is not yet finished, it will block. And as soon as you have blocking, you have the potential for deadlocking, live-locking and priority inversion. Except now the locking is hidden from, and inaccessible to the programmer.
And what's worse, the very simplicity of Futures is -- as happened with the synchronized keyword in Java -- going to encourage people to throw threads at everything without understanding or thinking through the consequences. The very act of providing a simple, high-level construct that hides much of the mechanics of threading and locking, will encourage them to be used widely and inappropriately.
It is possible to implement the hidden lock such that it will detect deadlocks and produce diagnostics, but the runtime penalties of every implementation I have seen are such that for any performance critical application -- why else are you using threading? -- the penalties add up to the point where programmers will need to revert to the low-level constructs in order to regain the level of control they need to realise the performance they are seeking.
And finally, there are a whole raft of classes of algorithm for which (naive) Futures are not just inappropriate, but completely unusable. For some of these classes, a sophisticated implementation can mitigate that and render them usable; but that is still the subject of research.
The bottom line is that to implement Futures such that they are reliable, easy to use and sophisticated enough to allow their use for a wide range of problems, is extremely hard. And testing them even harder. If P6 shipped -- should that mythical event ever occur -- with a naive implementation, or a poorly tested sophisticated one, the damage to its reputation would be huge.
Now think how long (and how many re-writes) it has taken for Java threading to approach some level of real usability.
I'd appreciate a link to a decent recent technical discussion of the problem of the heaviness of shared memory in P5 and reasonable potential fixes. Failing that, a couple paragraphs that summarize the problem and the fix you're suggesting would be great. Or maybe you could write a meditation that invites monks to focus on the technical issues related to the problem and potential fixes?
I started to put something together in reply to Corion's post above, but RL took priority. Give me a week or two and I'll post my demonstration of the problems, highlight (some of) the source(s) of those problems; and offer some suggestions as to how they could be solved (for perl5).