Re^3: Parrot, threads & fears for the future.

Well, that is how (real) threading works. The problem is that Perl was not designed to support threading and trying to bolt threading on to the side of Perl has been attempted for years and nothing particularly good has been produced. The prior attempts at supporting threading in Perl were way too buggy and were eventually abandoned. The latest Perl threading tries to avoid the bugs (and is less buggy but is still buggy enough in my experience that avoiding it is usually wise) by having each thread duplicate the entire Perl interpretter. This gives us a sort of "worst of both worlds" situation (no operating system protections like we have with real fork, yet even more copying, memory use, and slowness than real fork). So I don't really consider it "threading" (more like a rather bizarre specialized use of threads, not a general-purpose threading implementation) nor an appropriate tool for the vast majority of cases.

If you had real threads, then what you describe would be completely natural. With Perl's current threading model, even shared data structures aren't simply shared (so someone said) so I doubt it would work well to try to share a large data structure using the current Perl threads.

Another approach for something like this is to use shared memory. Both Unix and Win32 support very nice shared memory facilities. However, Perl nearly refuses to deal with memory other than what its chosen malloc() hands to it, so getting Perl to use shared memory is quite difficult and once you get it to use it, you still end up serializing between the shared memory (that Perl won't work with directly) and Perl scalars. So, again, it likely isn't an appropriate solution for your problem.

- tye

[reply]

Unfortunately, the cost of using iThreads shared memory, required for the read and write buffers, is so high that using iThreads to do overlapped IO is impractical:

cmpthese -1, {
    shared => q[ my $x : shared; ++$x for 1 .. 1e6 ],
 nonshared => q[ my $x         ; ++$x for 1 .. 1e6 ],
};;
            (warning: too few iterations for a reliable count)

          s/iter    shared nonshared
shared      1.31        --      -89%
nonshared  0.141      834%        --
[download]

There are other problems also. Whilst thread == interpreter, each read and write means giving up that threads timeslice and a task switch, before the transform thread can do work. But, with interpreter == a kernel thread, when the task switch occurs, there is no guarantee (in fact very low possibility), that the transform thread will get the next timeslice as the round robin is on all kernel threads. Those of this process and all others in the system. The upshot of that is that it takes at least 3 (or more) task switches to read and transform a record and at least 3 more to write one.

The idealised situation would be that as soon as the transform thread has got hold of the last record read, the read thread would issue the read for the next one--going straight in to the IO wait--and the transform thread would be able to continue the timeslice. You cannot arrange for that to happen using kernel threads. At least not on a single cpu processor where it would be of most benefit.

If thread != interpreter. IE. if more than one thread could be run within a single interpreter, then you could use cooperative (user-space/user dispatched) threads (fibres in Win32 terms. unbound threads in Solaris terms), to achieve this.

The transform thread copies the previously read record and transfers control to the read thread.
The read thread issues an asyncIO request for the next record and then transfers control back to the transform thread.
When the transform thread finishes with this record it gives it to the write thread; loops back and transfers control back to the read thread.
The read thread then does it's wait for io completion, which normally will have already completed whilst the transform thread was running, so no wait occurs. So, it transfers control back to the read thread which copies the new record and we're back to step 1.

I've truncated the write thread participation but it is essentially a mirror image of the read thread. So, with 3 cooperatively dispatched user threads running in the same kernel thread, the process is able to fully utilise every timeslice allocated to it by the OS.

Using 3 kernel threads, 2 out of every 3 timeslices allocated to the process have to be given up almost immediately due to IO waits. The time-line for each read-transform-write cycle (simplistically) looks something like:

 read       |   xform       |  write
thread      |  thread       |  thread
------------|---------------|---------------   
Issue read  | wait lock(in) | wait lock(out)
IO wait     |               |               
            |               |               
--------------------------------------------
            |               |               
            ~               ~              
 some unknown number of kernel task switches
            ~               ~              
            |               |               
----------------------------------------------
Read completes     "        |  wait lock(out)
signal record               |               
issue next read             |               
IO wait     |  wait lock(in)|               
--------------------------------------------
            |               |               
            ~               ~              
 some unknown number of kernel task switches
            ~               ~              
            |               |               
----------------------------------------------
 IOwait     |obtain lock(in)|  wait lock(out)
            | do stuff      |               
            | do stuff      |               
            | wait lock(out)|
            | signal write  |               
            | loop
--------------------------------------------
            |               |               
            ~               ~              
 some unknown number of kernel task switches
            ~               ~              
            |               |               
----------------------------------------------
            | wait lock(in) | obtain lock(out)
            |               | write out     
            |               | IO wait       
            |               |               
            |               |
[download]

Even better than the AIO/fibres mechanism above, is overlapped-IO combined with asynchronous procedure calls (APC), but that is "too Redmond" for serious consideration here.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

tye

It's not necesssarily as bad as two-out-of-three timeslices immediately given up for I/O - the transform could potentially consume many timeslices. However I take your point that the reader and writer will probably make poor use of their timeslices.

[reply]


XP is just a number
	PerlMonks