in reply to
Re^2: (OT) Programming languages for multicore computers
in thread (OT) Programming languages for multicore computers
Thank you. That's an interesting article to read.
But, like so many articles that set out to prove the point they want to make, rather than looking at the evidence and see where it leads, you spend much of the time setting up Straw Men, and then proceed to knock them down.
- The first mistake you make, (in the chronology of the article), is exactly the "big problem with threading" that my post above took special pains to refute--but you probably didn't read down that far--that of "synchronisation & locking".
The whole paragraph starting "There are some basic trade-offs to understand.", is just an elaborately constructed straw man. Synchronisation & locking is only a problem, if you design your algorithms to need synchronisation and locking. So, don't do that!
Without repeating all the carefully constructed explanations above: Why would anyone architect an algorithm to use 32k cores, so that they all needed to access the same piece of memory? It beggar's belief to think that anyone would, and the simple truth is "they" wouldn't. There is no possible algorithm that would require 32k threads to contend for a single piece of memory. None.
Let's just say for now(*), that we have a 32K core processor on our desktop, and we want to filter a modest 1024x1024 image. Do we apply a lock to the entire image and set 32k threads going to contend for it? Or do we create one lock for each individual pixel? And the answer to both is a resounding: No! Why would we. We simply partition the 1M pixels into 32K lots of 32, and give one partition to each of the 32k threads.
Viola! The entire 1MP image filtered in the same time it takes one thread to filter a 32-pixel image. No locks. No contention. And the only "synchronisation" required is the master thread that waits for them to all finish. One load from disk. One store back to disk. As a certain annoying TV ad here in the UK has it: simples!
I don't really believe that any of use will have anything like 32k cores on our desktops in the next 20 years, but I'm just going with your numbers. See below.)
You simply cannot do that anywhere near as efficiently with processes; nor message passing models; nor write-once storage models.
- The second mistake you make is that you mutate Moore's Law: "the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years", into: "we're doubling the number of cores.". And that simply isn't the case.
You can do many things with those transistors. Doubling the number of cores is only one possibility--and one that no manufacturer has yet taken in any two successive Moore's cycles! Other things you can do are:
- Increase the width of the address bus.
8-bit; 16-bit; 32-bit; 64-bit. We probably won't need 128-bit addressing any time soon...
- Increase the width of the data paths.
But 128-bit data-paths. This is already happening with SIMD instruction sets; GPUs and other SPUs.
- Increase the size of on-dye memory (cache).
We started with 2k caches; these are now at 2 MB for L3 cache on some Nehalem processors.
- Increase the number of levels of on-dye caching.
We started with one level; we now have three levels. What's the betting we'll see L4 caching at the next die-shrink 'tick'?
- Add on-dye high-speed core-to-core data-paths to replace the Front Side Bus.
AMD did it a while back with their Hypertransport. Intel has just played catch-up with Nehalem and their QuickPath Interconnect.
- Add/increase the number of register files. Duplicate sets of registers that can be switched with a single instruction.
They've been doing this on embedded processors for decades.
- Implement NUMA in hardware.
Look at the architecture of the Cell processor. By using one PPE to oversee 8 SPEs, and giving each SPE its own local memory as well as access to the PPE larger on-chip caches, you effectively have an 8-way NUMA architecture on a chip.
In summary, relating modern CPU architectures to the OS techniques developed for the IO-bound supercomputer architectures of the 1980s and '90s just sets up another Straw Man that is easily knocked down. But it doesn't have any relevance to the short-term (2-5 years) or medium-term (5-10) futures. Much less the long-term 20 year time frame over which you projected your numbers game.
Making predictions for what will be in 20 years time in this industry is always fraught; but I'll have a go below(**).
- Your third mistake, albeit that it is just a variation on 2 above, is when you say: "On a NUMA system there is a big distinction between remote and local memory.".
Your assumption is that NUMA has to be implemented in software, at the OS level, and sit above discrete cores.
As I've pointed out above, IBM et al. (STI) have already implemented (a form of) NUMA in hardware, and it sits subordinate to the OS. In this way, the independence (contention-free status) of the cores running under NUMA is achieved, whilst they each operate upon sub-divisions of memory that come from a single, threaded-process at the OS level. This effectively inverts the "local" and "remote" status of memory as compared with convention NUMA clusters.
Whilst the local memory has to be kept in synchronisation with the remote memory, the contention is nominal as the contents of each local memory is simply caching a copy of a small section of the remote (main) memory. Just as L3 caching does within current processors. And as each local cached copy is of an entirely different sub-division of the main memory, there is no contention.
(**) My prediction for the medium-term 10 year, future (I chickened out of the long-term :), is that each application will be a single, multi-threaded process running within its own virtualised OS image. And at the hypervisor level, the task will be to simply distribute the threads and physical memory amongst the cores; bringing more cores on-line, or switching them off, as the workloads vary. This means that the threads of a particular VM/process/application will tend to cluster over a limited number of the available cores, controlled by dynamic affinity algorithms, but with the potential to distribute any single task's threads over the full complement of available cores should the need arise, and appropriate algorithm be chosen.
But that's about as far as I'm prepared to go. Time will tell for sure :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.