maa has asked for the wisdom of the Perl Monks concerning the following question:
Hi, folks
ActivePerl's fork emulation has much improved in perl 5.8.4 and I've been testing it on NT and have received some 'surprising' results.
My input file was 10_000 lines long, one 'computername' per line of the form computer00001, computer00002 etc. The script spawns (correctly) ten threads which each read from the file then prints it to a results file with the PID of the thread. All 10_000 computernames are in the results file, in order with no duplicates. What surprised me was that each thread appears to process 256 consecutive lines from the input file even though the threads all 'take a turn'. It's not important but I wondered if anyone could shed any light on why it would do that? When the output is shown on STDOUT each thread is takings its turn but jumping around the file, rather than taking sequential lines... not what I expected to see at all.
#!C:/ActivePerl5.8.4/bin/perl.exe
use strict;
use warnings;
require 5.8.4; #We only run if we're using Perl v5.8.4
$|++; #Unbuffer I/O
my @pids=();
my $parentpid=$$;
my $pid;
print "The parents PID -s $parentpid\n";
my $thread_counter=0;
my $testinput = "C:/Logchecks/test/forklist.txt";
my $testoutput = "C:/Logchecks/test/forkresults.csv";
open (IN,"<$testinput") or die("can't open input file $!\n");
open (OUT,">$testoutput") or die ("can't open output file $!\n");
OUTER:
for (1..10) {
$thread_counter++;
$pid=fork();
if ($pid==0) {
#child
@pids=();
$pid=$$;
$parentpid=0;
last OUTER;
}else{
push @pids,$pid;
print "Parent has spawned child $pid\n";
}
}
if ($parentpid == 0) {
#kid
my $items=0;
while (my $cname= <IN> ) {
chomp $cname;
$items++;
print "$cname checked by $$\n";
print OUT "$cname,checked by $$\n";
sleep 1;
}
print "Thread $$ processed $items items.\n";
exit(0);
}
else {
#parent
print "$thread_counter threads started - waiting on completion.\n";
Reaper();
print "Parent: Goodbye(:-)\n";
exit(0);
}
sub Reaper {
while (my $kid = shift(@pids)) {
#warn "$$ to reap $kid\n" ;
my $reaped = waitpid($kid,0);
unless ($reaped != -1) {
warn "waitpid $reaped: $?\n" ;
}
}
}
Thanks in advance - Mark
Re: Unexpected output from fork (Win32)
by BrowserUk (Patriarch) on Aug 09, 2004 at 11:58 UTC
|
It comes down to buffering.
The first time you ask perl to read a single line from the input file with <IN>, perl reads a buffer-sized chunk from the file and then gives you back a single line. Subsequent calls to <IN> (for that kid), then give you the next line from the buffer until it is exhausted, at which point perl refills the buffer.
Each of your kids will have its own buffer. So, each kid will process a group of lines (as meany as fill the internal buffer) sequentially, before reading the next buffer load. By the time that happens, each of the other threads have already filled their buffers, so this kid gets a buffer load 10x buffersize further down the file.
As your lines are 16 bytes, and each kid is processing 256 lines per buffer load, that makes the buffer size 4096 bytes. Notionally, the first thread to run will process lines 1..256 then 2561..2817 then 5121..5377 etc.
However, it would be dangeruos to make this assumption as the order in which the kids will be run is non-deterministic. It only appears somewhat deterministic in your example because the sleep 1 is having the effect of tending to serialise them.
If you remove that sleep, you will see a much greater variablilty in the results. Each thread will still tend to process blocks of 256 lines at a time, but the first 2 or 3 kids will tend to process the bulk of the file and the others will process little or nothing.
This effect is not (as is often assumed) a bug in the scheduler. It is simply that without the sleep, the first few kids use their full timeslice and by the time the 3rd or 4th kid is started, the whole file has been processed.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] [select] |
|
As your lines are 16 bytes, and each kid is processing 256 lines per buffer load, that makes the buffer size 4096 bytes. Notionally, the first thread to run will process lines 1..256 then 2561..2817 then 5121..5377 etc.
How blindingly obvious :-) Thanks.
This was only a test program - I put the sleep statement in because the first thread processed all 10_000 without it and the operations that will eventually be in there will certainly take several seconds to complete.
Once again thanks for a crystal clear explanation! It all makes sense now. - Mark
| [reply] |
Re: Unexpected output from fork (Win32)
by Jenda (Abbot) on Aug 09, 2004 at 12:51 UTC
|
$| only affects the output. That means input is still buffered. So when the first thread executes the my $cname= <IN> it reads not only the firs line but the first 4KB and next time it reads the line from its cache, not from the disk. It seems that in your case you were lucky and the 4KB chunks ended at the newlines but I don't think you should not take that for granted. If I try your script on a file generated by
open OUT, '>', 'forklist.txt';
print "computer$_\n" for (1..10000);
close OUT;
I do get results like:
...
r1835,checked by -2196
computer1836,checked by -2196
computer1837,checked by -2196
...
computer1965,checked by -2196
computer1966,checked by -2196
computer196ter1250,checked by -4140
computer1251,checked by -4140
computer1252,checked by -4140
...
computer1673,checked by -3928
computer1674,cheputer2713,checked by -3496
computer2714,checked by -3496
...
computer2843,checked by -3496
computer2844,checked by -3496
computeomputer2128,checked by -3120
computer2129,checked by -3120
computer2130,checked by -3120
...
Actually the way you use the $| it only affects STDOUT! Even the OUT handle is buffered! You'd better
use FileHandle;
...
OUT->autoflush();
That way you know what handle is unbuffered, $| looks like it is something global which it's not. It affects only the currently select()ed output handle!
You need to change your code to
- read the input file only in one thread
- flock the output filehandle before writing to it (and set the autoflush correctly)
You may either read the first $no_of_chunks/$no_of_threads into an array, spawn the first child, empty the array in parent, read the next chunk, ... or read the
file by the main thread and send the server names to the threads via pipes or Thread::Queue or shared variables or ...
Update (2 minutes after submit) : BrowserUk was quicker :-)
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
| [reply] [d/l] [select] |
|
thanks for your informative reply, Jenda
Yes - I see why my $|++ shouldn't work after reading perlvar again (although adding it did get rid of duplicate entries). I couldn't, however, find any mention of the size of the input buffer used by <> or readline() - they both simply promise to return/read up to the next $/ (or EOF) when evaluated in scalar context. Can you point me to the apt document, please? I (wrongly) assumed that, as the seek pointer is shared that I'd get whole records, but you've disproved that :-)
I tried using an array containing all the input already but that has its own problems when you use fork() - perhaps it's time I tried to use threads; :-) Then I can share the array.
- Mark
| [reply] [d/l] |
|
A couple of things.
First, as Win32 pseudo-forks are threads, you can (apparently) use threads::shared to share an array (or other data) between them:
From threads::shared POD:
DESCRIPTION
By default, variables are private to each thread, and each newly created thread gets a private copy of each existing variable. This module allows you to share variables across different threads (and pseudoforks on Win32). It is used together with the threads module.
Though I admit I've never actually tried this.
Second. Doing the equivalent of your OP code using threads is much simpler.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] |
|
It seems the seek point is shared, but the caches are not. Which IMHO doesn't make sense. Either the handles should be completely separate or they should share the cache.
I did not mean to share the array. The main thread would read the first tenth of the computer names into an array and fork() off a child, the child would have a copy of the array and would start processing those servers. In the meantime the main thread would empty its copy of the array, read the next tenth and spawn another child. And so forth.
Of course this means that you will have the complete list of computer names in memory, which may and may not be the best thing to do.
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
| [reply] |
|
|
Re: Unexpected output from fork (Win32)
by tachyon (Chancellor) on Aug 09, 2004 at 11:23 UTC
|
| [reply] |
|
10 threads started - waiting on completion.
Computer000257 checked by -378
Computer000513 checked by -380
Computer000769 checked by -349
Computer001025 checked by -366
Computer001281 checked by -326
Computer001537 checked by -305
Computer001793 checked by -338
Computer002049 checked by -360
Computer002305 checked by -127
Computer000002 checked by -373
Computer000258 checked by -378
Computer000514 checked by -380
The input is sequential, Computer000000 - Computer10000. I just though it was 'wierd'.
| [reply] |
|
|