Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Parallel processing with ForkManager

by iotarho (Initiate)
on Sep 20, 2012 at 07:36 UTC ( #994602=perlquestion: print w/replies, xml ) Need Help??
iotarho has asked for the wisdom of the Perl Monks concerning the following question:

I'm fairly new to Perl, so apologies if my question is poorly phrased... I'm trying to take advantage of a 64 CPU cluster plus Perl's text-grabbing/matching capabilities. I have a fairly large (~100MB) file full of names (a few names per line in the file) that I need to individually look up and retrieve information for from an even larger (~5 GB) database. Would ForkManager be a good way to implement a "divide and conquer" approach in Perl? I haven't been able to find good examples of using ForkManager to read a file a line-at-a-time and then doing something with those lines.

Replies are listed 'Best First'.
Re: Parallel processing with ForkManager
by DrHyde (Prior) on Sep 20, 2012 at 10:42 UTC

    Parallel::ForkManager is certainly a good tool for managing a bunch of processes all under the control of a single "master" process which, in your case, would be the one that reads the 100MB file. However, you need to be careful.

    Things to consider include:

    • How many parallel clients can the database handle before it becomes a significant bottleneck?
    • What is the overhead of forking - it's almost certainly too high to naively fork a new process for processing each line in the file.
    • What do you need to do with the data retrieved from the db? While Parallel::ForkManager can return data from each forked process, it fakes this up by going via the disk. Will this turn into an I/O bottleneck?
    • What is the overhead of connecting to the DB, and how can you reduce that?
Re: Parallel processing with ForkManager
by zentara (Archbishop) on Sep 20, 2012 at 11:27 UTC
    I haven't been able to find good examples of using ForkManager to read a file a line-at-a-time and then doing something with those lines.

    This isn't exactly your scenario, but it may help get you going.

    #!/usr/bin/perl use warnings; use strict; use Parallel::ForkManager; my $dir = shift || '.'; my @dirs = get_sub_dirs($dir); my $max_tasks = 3; my $pm = new Parallel::ForkManager($max_tasks); $|++; my $start = time(); for my $dir (@dirs) { my $pid = $pm->start and next; printf "Begin processing $dir at %d secs.....\n", time() - $start; #push all the $dir/files into @ARGV and search through them #line by line @ARGV = <$dir/*>; while (<ARGV>) { close ARGV if eof; #find some search term, using "perl" for example if( $_ =~ /perl/) { print "$ARGV: $. :$_\n"; $pm->finish; goto END; } } END: printf ".... $dir done at %d secs!\n", time() - $start; $pm->finish; } print " all done\n"; exit; ########################################################## sub get_sub_dirs { my $dir = shift; opendir my $dh, $dir or die "Error: $!"; my @dirs = grep { -d $_ } readdir $dh; @dirs = grep !/^\.\.?$/, @dirs; closedir $dh; return @dirs; }

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: Parallel processing with ForkManager
by BrowserUk (Pope) on Sep 20, 2012 at 15:09 UTC

    You'd almost certainly be better off parsing the names from the file and bulk-loading them into a temporary table within the DB. Then issue a single join to select the information you need into another temporary table; and finally dump that to a CSV for further processing if needed.

    The database is likely to make far more effective use of the threading available to it that way, than having to serialise thousands of concurrent (effectively identical) queries from 64 different clients.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Re: Parallel processing with ForkManager
by sundialsvc4 (Abbot) on Sep 20, 2012 at 13:34 UTC

    I tend to agree that it probably would be stoppered-up by the capacity of the database server.   And let’s face it ... neither a 100MB text-file nor a 5GB database is, by today’s standards, that large.   Maybe you could make some read-only copies of the database at various places.   Maybe you could optimize the search process in the database in some useful way.   In general, I just think that trying to cluster this thing is going to be a lot of trouble, for doubtful benefit.

    Clustering works really well when the workload is primarily CPU-bound and when there are no resource-contentions.   Here, both of these are not-the-case.

    Edit:   BrowserUK’s subsequent recommendation to use temporary tables and a join-query, below, is in my view unquestionably the best approach to take in this case.   Now, nothing but the bulk move-in and the bulk move-out is “happening over the wire.”   The computer gets the essential job done in one step, and strictly within its own optimized world.

Re: Parallel processing with ForkManager
by bennymack (Pilgrim) on Sep 21, 2012 at 13:03 UTC

    While I don't completely understand the work you're trying to do, I suggest checking out GNU/Parallel. It's usually a good fit for this type of stuff. It supports the unix pipeline philosophy quite well and lets you the programmer worry about the algorithm and keep it separate from the burden of scheduling tasks, distributing work, etc.

    So, for example, create a simple program that can do the lookups on a line by line basis then call parallel with the --pipe option and it will chunk up your input file and call your program on all available cores.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://994602]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2018-11-17 04:47 GMT
Find Nodes?
    Voting Booth?
    My code is most likely broken because:

    Results (202 votes). Check out past polls.