Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Parallel running the process

by ariesb2b (Initiate)
on Mar 06, 2012 at 18:52 UTC ( #958144=perlquestion: print w/ replies, xml ) Need Help??
ariesb2b has asked for the wisdom of the Perl Monks concerning the following question:

I have created a Perl script to find out number of Total/Active/Current users in database. I provide a file which contains a list of databases as an input argument (perl_script -i <filename>)

After querying every database i print out result to a file in a tabular format. The complete script takes around 2-3 hrs to run depending on the database usage. Currently script runs 1 by 1 on each database starting from first till the end.

I want to know if i can run the script on all databases in parallel so that it takes a less time to complete.

Comment on Parallel running the process
Re: Parallel running the process
by roboticus (Canon) on Mar 06, 2012 at 19:01 UTC

    ariesb2b:

    One simple way is to:

    1. Split the file listing the databases into chunks
    2. Run the script multiple times, in parallel
    3. Concatenate the output files into your report

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      thanks roboticus. yes definitely this is a way to do it. But This involves lot of manual work to be done. I have to run the script monthly and will be putting it inside cron job.

      Is there any way that i can do this in a script itself. Like the process running in background and it writes the output in the desired file as and when it completes.

        Sure. You may want to read about fork. Or threads, if you feel more comfortable. Or the classic solution: a select loop. But I assume you're using the DBI to give you database access, and I don't think the DBI has support to give back control to the program while waiting for the results.

        I would probably use fork, but that's because I understand the concept well, and a simple solution seems to be good enough to solve your problem.

        ariesb2b:

        It doesn't have to be hard. I was thinking that if that approach worked for you, that you could write a simple script to do the overhead for you. Something like (untested):

        #!/usr/bin/perl use strict; use warnings; use autodie; my $num_jobs=4; # Split the work up open my $IFH, '<', 'data.inp'; my @OFH; open $OFH[$_-1], '>', "data.$_" for 1 .. $num_jobs; my $cnt=0; while (<$IFH>) { ++$cnt; my $FH = $OFH[$cnt%$num_jobs]; print $FH, $_; } close $OFH[$_-1] for 1 .. $num_jobs; # Do the work for my $j (1 .. $num_jobs) { `perl orig_do_job --infile=data.$_ --outfile=data.out.$_ & `; } # Collect the results `cat data.out.* >data.out`;

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Parallel running the process
by BrowserUk (Pope) on Mar 06, 2012 at 19:45 UTC
    provide a file which contains a list of databases as an input argument (perl_script -i <filename>)

    Are the databases all on the same server or different servers?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Parallel running the process
by sundialsvc4 (Monsignor) on Mar 06, 2012 at 21:33 UTC

    Could you also please give us some idea of how your query or queries are constructed, and a brief idea of what indexes (if any) might be in play?   My intuitive sense is that “2 to 3 hour” run-times might be avoidable, and if that be the case it would make all the difference.

    Otherwise, I think, the operative question would be how well your database servers could handle whatever-it-is that you are doing.   Particularly if your queries are resource intensive (and let us for the moment presume that they unfortunately must be), the question most-likely is going to be determined by just how many such queries your hardware and software is able to handle, not specifically “which database” is the specified target.   (“This or that database” might merely boil down to a choice between directories... functionally irrelevant.)

      It seems to me that he's polling several databases sequentially. Assuming the servers aren't all sharing the same CPU, disks, and network cables doing this in parallel seems an obvious and worthwhile win.

      Regardless whether the query is optimal or not.

Re: Parallel running the process
by i5513 (Monk) on Mar 06, 2012 at 23:40 UTC
    Hi,
    I recommend you using pdsh
    In your case should be:
    • modify your script to receive only database name as parameter (and to make only the work needed to that database), databases file contains line by line all the databases
    • $ pdsh -w^databases -R exec perl-script %h | tee outputs
    • then play with outputs (see dshbak), probably with an script which recollect the info
    I hope that help
Re: Parallel running the process
by marioroy (Acolyte) on Nov 25, 2012 at 06:59 UTC

    MCE is a new Perl module recently added to CPAN. This is how one may use MCE to process in Parallel. MCE is both a chunking and parallel engine. In this case, chunk_size is set to 1. That option is not needed as calling the foreach method will set it to 1 anyway.

    The sendto method can be used to serialize data from workers to a file. MCE also provides a do method as well to pass data to a callback function which runs from the main process.

    $chunk_ref is a reference to an array. MCE provides both foreach and forchunk methods. In this case, the array contains only 1 entry due to chunk_size being set to 1.

    The main page at http://code.google.com/p/many-core-engine-perl/ contains three images. The 2nd one shows the bank queuing model used in MCE with chunking applied to it.

    use MCE; ## Parse command line argument for $database_list my $mce = MCE->new( max_workers => 4, chunk_size => 1 ); $mce->foreach("$database_list", sub { my ($self, $chunk_ref, $chunk_id) = @_; my $database = $chunk_ref->[0]; my @result = (); ## Query the database $self->sendto('file:/path/to/result.out', @result); });

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://958144]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2014-07-24 11:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (160 votes), past polls