This kind of parallelism is the kind that is already fairly well handle by dedicated vector processors (as the big iron guys call them) and by GPUs and DSPs. The kind where a single, identical, often CISC, opcode is performed on a large number of datapoints in parallel.
This kind is quite easily dealt with in hardware. The depth of the parellelisation is basically controlled by how much silicon you are prepared to dedicate to the operation.
A generalisation of this would be to allow a whole sequence of opcodes (an entire subroutine or block of code) to applied to a large number of datapoints simultaneuosly. Again, the datapoints would be loaded into "special memory", hardwired to operate each opcode on all of those registers in parallel.
The criteria for what constituted a suitable block of code (ie. reentrant code without side-effects or dependancies beyond it's parameters, and probably with a single result from each call), would be a software (compiler or interpreter) decision. It would also require good optimisers (whether compiler or interpreter) to make best use of such parellelism.
Sync point parellelism.
This is where different sections--usually sequential--of the source code can be run in parallel until a point is reached where they share a common dependancy. Although analogous to the pipelining that many modern processors do, to get best benefit, it needs to be done at a macro (source code) or function point) level rather than the micro (a few sequential opcodes) level as done is with pipelining.
I think (within the limitations of my sparse knowledge of silicon), that pipelining has gone just about as it can go. The economics of branch-point prediction and the costs of flushing the pipeline when the BPP goes wrong, severely limit the effectiveness of pipelining beyond a certain, rather low, limit.
In order for best use of syncpoint parellelism to made, OSs will need to radically alter the scheduling schemers they use. The current round-robin within priority groups, with starvation priority promotion, and processes (or threads as currently implemented) as atomic entities will have to be replaced by a much finer grained mechanism. And compilers and interpreters will have to get a lot cleverer to make good use of it.
In effect, the processor pool will be driven by a 'macro-processor' overseen, queue of 'units of code' that need processing. Each individual process will constitute a stream of VHL opcodes. The macro-processor will enqueue these, interleaving the VHL opcodes from different processes according to priorities etc. The pool of processors will pull the next available unit of code off of the central queue and execute it (without regard to what process it belongs to), and then go back and grab the next available. Reentrancy will be paramount. As will a capabilities based security mechanism.
An interesting, and site-topical, sideeffect is that interpreted code will probably carry a much smaller penalty relative to compiled code, as the compilers will need to produce streams or groups of small, self-contained units of code.
Once code is compiled/interpreted into these self-contained units, it then becomes possible to transmit these units to external processors or pools of cooperating machines to achieve massively parallel operations across peer groups of lan/net connected machines.
It's a hard concept to describe in words, and my attempts at an ASCII art diagram left much to be desired. There are probably no web references I can give either as it is very much a prediction of where I personally think things will go, rather than a recounting of any individual piece I have read.
A sort of mental mish-mash, reading between the lines of everything I have read.