Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Unexpected output from fork (Win32)

by maa (Pilgrim)
on Aug 09, 2004 at 09:49 UTC ( [id://381182]=perlquestion: print w/replies, xml ) Need Help??

maa has asked for the wisdom of the Perl Monks concerning the following question:

Hi, folks

ActivePerl's fork emulation has much improved in perl 5.8.4 and I've been testing it on NT and have received some 'surprising' results.

My input file was 10_000 lines long, one 'computername' per line of the form computer00001, computer00002 etc. The script spawns (correctly) ten threads which each read from the file then prints it to a results file with the PID of the thread.

All 10_000 computernames are in the results file, in order with no duplicates. What surprised me was that each thread appears to process 256 consecutive lines from the input file even though the threads all 'take a turn'. It's not important but I wondered if anyone could shed any light on why it would do that?

When the output is shown on STDOUT each thread is takings its turn but jumping around the file, rather than taking sequential lines... not what I expected to see at all.

#!C:/ActivePerl5.8.4/bin/perl.exe use strict; use warnings; require 5.8.4; #We only run if we're using Perl v5.8.4 $|++; #Unbuffer I/O my @pids=(); my $parentpid=$$; my $pid; print "The parents PID -s $parentpid\n"; my $thread_counter=0; my $testinput = "C:/Logchecks/test/forklist.txt"; my $testoutput = "C:/Logchecks/test/forkresults.csv"; open (IN,"<$testinput") or die("can't open input file $!\n"); open (OUT,">$testoutput") or die ("can't open output file $!\n"); OUTER: for (1..10) { $thread_counter++; $pid=fork(); if ($pid==0) { #child @pids=(); $pid=$$; $parentpid=0; last OUTER; }else{ push @pids,$pid; print "Parent has spawned child $pid\n"; } } if ($parentpid == 0) { #kid my $items=0; while (my $cname= <IN> ) { chomp $cname; $items++; print "$cname checked by $$\n"; print OUT "$cname,checked by $$\n"; sleep 1; } print "Thread $$ processed $items items.\n"; exit(0); } else { #parent print "$thread_counter threads started - waiting on completion.\n"; Reaper(); print "Parent: Goodbye(:-)\n"; exit(0); } sub Reaper { while (my $kid = shift(@pids)) { #warn "$$ to reap $kid\n" ; my $reaped = waitpid($kid,0); unless ($reaped != -1) { warn "waitpid $reaped: $?\n" ; } } }

Thanks in advance - Mark

Replies are listed 'Best First'.
Re: Unexpected output from fork (Win32)
by BrowserUk (Patriarch) on Aug 09, 2004 at 11:58 UTC

    It comes down to buffering.

    The first time you ask perl to read a single line from the input file with <IN>, perl reads a buffer-sized chunk from the file and then gives you back a single line. Subsequent calls to <IN> (for that kid), then give you the next line from the buffer until it is exhausted, at which point perl refills the buffer.

    Each of your kids will have its own buffer. So, each kid will process a group of lines (as meany as fill the internal buffer) sequentially, before reading the next buffer load. By the time that happens, each of the other threads have already filled their buffers, so this kid gets a buffer load 10x buffersize further down the file.

    As your lines are 16 bytes, and each kid is processing 256 lines per buffer load, that makes the buffer size 4096 bytes. Notionally, the first thread to run will process lines 1..256 then 2561..2817 then 5121..5377 etc.

    However, it would be dangeruos to make this assumption as the order in which the kids will be run is non-deterministic. It only appears somewhat deterministic in your example because the sleep 1 is having the effect of tending to serialise them.

    If you remove that sleep, you will see a much greater variablilty in the results. Each thread will still tend to process blocks of 256 lines at a time, but the first 2 or 3 kids will tend to process the bulk of the file and the others will process little or nothing.

    This effect is not (as is often assumed) a bug in the scheduler. It is simply that without the sleep, the first few kids use their full timeslice and by the time the 3rd or 4th kid is started, the whole file has been processed.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      As your lines are 16 bytes, and each kid is processing 256 lines per buffer load, that makes the buffer size 4096 bytes. Notionally, the first thread to run will process lines 1..256 then 2561..2817 then 5121..5377 etc.

      How blindingly obvious :-) Thanks.

      This was only a test program - I put the sleep statement in because the first thread processed all 10_000 without it and the operations that will eventually be in there will certainly take several seconds to complete.

      Once again thanks for a crystal clear explanation! It all makes sense now.

      - Mark

Re: Unexpected output from fork (Win32)
by Jenda (Abbot) on Aug 09, 2004 at 12:51 UTC

    $| only affects the output. That means input is still buffered. So when the first thread executes the my $cname= <IN> it reads not only the firs line but the first 4KB and next time it reads the line from its cache, not from the disk. It seems that in your case you were lucky and the 4KB chunks ended at the newlines but I don't think you should not take that for granted. If I try your script on a file generated by

    open OUT, '>', 'forklist.txt'; print "computer$_\n" for (1..10000); close OUT;
    I do get results like:
    ...
    r1835,checked by -2196
    computer1836,checked by -2196
    computer1837,checked by -2196
    ...
    computer1965,checked by -2196
    computer1966,checked by -2196
    computer196ter1250,checked by -4140
    computer1251,checked by -4140
    computer1252,checked by -4140
    ...
    computer1673,checked by -3928
    computer1674,cheputer2713,checked by -3496
    computer2714,checked by -3496
    ...
    computer2843,checked by -3496
    computer2844,checked by -3496
    computeomputer2128,checked by -3120
    computer2129,checked by -3120
    computer2130,checked by -3120
    ...
    

    Actually the way you use the $| it only affects STDOUT! Even the OUT handle is buffered! You'd better

    use FileHandle; ... OUT->autoflush();
    That way you know what handle is unbuffered, $| looks like it is something global which it's not. It affects only the currently select()ed output handle!

    You need to change your code to

    1. read the input file only in one thread
    2. flock the output filehandle before writing to it (and set the autoflush correctly)

    You may either read the first $no_of_chunks/$no_of_threads into an array, spawn the first child, empty the array in parent, read the next chunk, ... or read the file by the main thread and send the server names to the threads via pipes or Thread::Queue or shared variables or ...

    Update (2 minutes after submit) : BrowserUk was quicker :-)

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

      thanks for your informative reply, Jenda

      Yes - I see why my $|++ shouldn't work after reading perlvar again (although adding it did get rid of duplicate entries). I couldn't, however, find any mention of the size of the input buffer used by <> or readline() - they both simply promise to return/read up to the next $/ (or EOF) when evaluated in scalar context. Can you point me to the apt document, please? I (wrongly) assumed that, as the seek pointer is shared that I'd get whole records, but you've disproved that :-)

      I tried using an array containing all the input already but that has its own problems when you use fork() - perhaps it's time I tried to use threads; :-) Then I can share the array.

      - Mark

        A couple of things.

        First, as Win32 pseudo-forks are threads, you can (apparently) use threads::shared to share an array (or other data) between them:

        From threads::shared POD:

        DESCRIPTION

        By default, variables are private to each thread, and each newly created thread gets a private copy of each existing variable. This module allows you to share variables across different threads (and pseudoforks on Win32). It is used together with the threads module.

        Though I admit I've never actually tried this.

        Second. Doing the equivalent of your OP code using threads is much simpler.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

        It seems the seek point is shared, but the caches are not. Which IMHO doesn't make sense. Either the handles should be completely separate or they should share the cache.

        I did not mean to share the array. The main thread would read the first tenth of the computer names into an array and fork() off a child, the child would have a copy of the array and would start processing those servers. In the meantime the main thread would empty its copy of the array, read the next tenth and spawn another child. And so forth.

        Of course this means that you will have the complete list of computer names in memory, which may and may not be the best thing to do.

        Jenda
        Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
           -- Rick Osborne

Re: Unexpected output from fork (Win32)
by tachyon (Chancellor) on Aug 09, 2004 at 11:23 UTC

      there's no 'problem' with the results I just found it odd that each thread had reportedly processed 256 consecutive items even though they weren't actually done consecutively.

      Sample output:

      
      10 threads started - waiting on completion.
      Computer000257 checked by -378
      Computer000513 checked by -380
      Computer000769 checked by -349
      Computer001025 checked by -366
      Computer001281 checked by -326
      Computer001537 checked by -305
      Computer001793 checked by -338
      Computer002049 checked by -360
      Computer002305 checked by -127
      Computer000002 checked by -373
      Computer000258 checked by -378
      Computer000514 checked by -380
      
      The input is sequential, Computer000000 - Computer10000. I just though it was 'wierd'.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://381182]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-25 10:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found