Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Using Perl to run a Windows command-line utility many times with ordered, parallel execution

by Jim (Curate)
on Jan 26, 2014 at 20:42 UTC ( [id://1072144]=perlquestion: print w/replies, xml ) Need Help??

Jim has asked for the wisdom of the Perl Monks concerning the following question:

I want to use Perl 5.14 (ActiveState ActivePerl) to manage many thousands of executions of an external command-line utility under Microsoft Windows, each with a different argument. If I run them in sequence in a simple loop, they'll take too long to finish, so I want to run them in parallel as well as in an ordered sequence, n at a time (where the optimum value of n is probably going to be 20). I've never done this kind of job control before using Perl, so I need help getting started. Before now, I've used both tricks with the start command in funky systems of batch files and the nifty Unix xargs utility under both the MKS Toolkit and Cygwin. Now, for several important reasons, I need to use Perl instead.

So let's say I want to run a command-line utility named doit.exe 10,000 times on the arguments toit0000, toit0001, toit0002, …, toit9999. I want to invoke the jobs generally in that order, but I want to run them in parallel, 20 at a time. How do I do this in Perl? Assume I have a simple array of the ordered arguments; for example:  my @ordered_arguments = <DATA>; chomp @ordered_arguments;.

Jim

Replies are listed 'Best First'.
Re: Using Perl to run a Windows command-line utility many times with ordered, parallel execution
by BrowserUk (Patriarch) on Jan 26, 2014 at 21:08 UTC

    Just add the DATA :)

    #! perl -slw use strict; use threads stack_size => 4096; use threads::Q; sub thread { my $Q = shift; while( my $item = $Q->dq ) { system qq[ doit.exe $item ]; } } our $T //= 20; my @ordered_arguments = <DATA>; chomp @ordered_arguments; my $Q = threads::Q->new( $T * 2 ); my @threads = map async( \&thread, $Q ), 1 .. $T; $Q->nq( $_ ) for @ordered_arguments; $Q->nq( (undef) x $T ); $_->join for @threads; __DATA__ ...

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

        Thread::Queue would work just as well, though it would consume a little more memory without some additional code to limit the size of the queue.

        But for the OPs 10k items and 20 threads that would still come in at less than 40mb.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Using Perl to run a command-line utility many times with ordered, parallel execution
by Anonymous Monk on Jan 26, 2014 at 20:50 UTC
Re: Using Perl to run a Windows command-line utility many times with ordered, parallel execution
by ambrus (Abbot) on Jan 27, 2014 at 19:38 UTC

    Just write something simple yourself, like

    while (@command) { my @pid; for $cmd (splice @command, 0, $how_many_parallel) { say $cmd; push @pid, system(1, $cmd); } for $pid (@pid) { $pid == waitpid $pid or die; $? and die; } }

      Thank you, ambrus. This is precisely the example I needed to help me get started.

      It wasn't clear enough from my original post that my problem isn't just that I don't understand how to do Windows process control using Perl. My problem is that I don't understand process control well at all. And when I read about it in documentation—not just Perl documentation, any documentation—my head explodes. I struggle with the unfamiliar lingo. If there's a good tutorial for absolute beginners, I haven't found it yet. But with the help of your straightforward Perl code snippet, I was able to make a good start.

      So here's the script I cobbled together based on your example. It has extra junk in it that's only there for self-educational purposes. Also, there are actually thousand of lines of DATA (i.e., external commands to be run), not just these few.

      use strict; use warnings; use English qw( -no_match_vars ); # For $CHILD_ERROR use POSIX (); my $BATCH_SIZE = 8; my @commands; LINE: while (<DATA>) { next LINE if m/^\s*#/; chomp; my ($txt_file, $tab_file, $total_documents) = split m/,/, $_, 3; my $command = "doit $txt_file > $tab_file"; push @commands, [ $command, $txt_file, $total_documents ]; } while (@commands) { my @pids; my %txt_file_by; for my $cmd (splice @commands, 0, $BATCH_SIZE) { my ($command, $txt_file, $total_documents) = @$cmd; my $pid = system(1, $command); push @pids, $pid; my $timestamp = POSIX::strftime('%H:%M:%S', localtime); print "$timestamp\t$pid\t$command\n"; $txt_file_by{$pid} = $txt_file; } for my $pid (@pids) { $pid == waitpid($pid, 0) or die; die if $CHILD_ERROR; my $timestamp = POSIX::strftime('%H:%M:%S', localtime); print "$timestamp\t$pid\t$txt_file_by{$pid}\n"; } } exit 0; __DATA__ D000349000.txt,D000349000.tab,564530 Z0000042.txt,Z0000042.tab,457277 Z0000013336.txt,Z0000013336.tab,457277 Z0000013426.txt,Z0000013426.tab,382292 D000250000.txt,D000250000.tab,382014 C000004770.txt,C000004770.tab,356580 Z000003462.txt,Z000003462.tab,356580 Z000004770.txt,Z000004770.tab,356580 Z0000012073.txt,Z0000012073.tab,349325 D000303000.txt,D000303000.tab,347852 Z0000013787.txt,Z0000013787.tab,347852 Z0000014288.txt,Z0000014288.tab,289025 D004607000.txt,D004607000.tab,268763 D000245000.txt,D000245000.tab,258363 Z0000012214.txt,Z0000012214.tab,257861 Z0000013342.txt,Z0000013342.tab,257861 Z0000015322.txt,Z0000015322.tab,243612 D000275000.txt,D000275000.tab,242962 D000272000.txt,D000272000.tab,224791 D000271000.txt,D000271000.tab,223537 D000717000.txt,D000717000.tab,216624 Z0000015315.txt,Z0000015315.tab,215390 D004457000.txt,D004457000.tab,211271 Z0000012004.txt,Z0000012004.tab,211271

      Until I implemented this, ran it, and watched it closely in action, I couldn't figure out either system() or waitpid(). I don't grok them, but I more-or-less understand what they're accomplishing. It's still unclear to me what the first argument of system(), 1, is for, and I also don't understand what the second argument of waitpid(), 0, is intended to do. An explanation of these mysterious arguments would be helpful.

      What are examples of appropriate messages to use with the two calls to die()? I don't fully understand what's being tested and could fail at those points in the script. More generally, how might I flesh out the error handling in the script to make it more robust?

      What's the difference between a process and a thread? When and why would I choose to use multiple processes rather than multiple threads and vice versa? I'm running Microsoft Windows, not Unix or Linux. How much does this matter?

      If there's an easier or slicker way to compute a timestamp than how I did it here using POSIX::strftime() and localtime(), I'd appreciate a tip.

      Thank you again for your help.

      Jim

        For the first argument of system being 1, please see perldoc perlport.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1072144]
Approved by tobyink
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-03-29 05:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found