Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

UPDATED: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations

by ImJustAFriend (Scribe)
on Aug 08, 2014 at 14:28 UTC ( [id://1096766]=perlquestion: print w/replies, xml ) Need Help??

ImJustAFriend has asked for the wisdom of the Perl Monks concerning the following question:

Good morning Monks. I'm working on a concept script right now to replace a KSH script we use (concept = not final var names, debug statements, no real comments, etc). The script parses through an input file that can be up to 5 million lines long. It takes that data and does 2 things with it: builds an array out of all values in one location via "split", and takes a substring of those same values to use as hash keys. What I want it to do then is to loop through the hash keys and fork a new process in parallel for each to grep through the array for the value and get a count. My code is below. My issues that I am asking for help on are threefold:

1.) The forking works, but it seems to get done serially. I would like it to happen in parallel for speed purposes.
2.) I am having issues passing the hash into the sub and/or returning the data back out. I get the following errors every run (multiple instances of each), and my "out" files don't get written successfully:
a. Use of uninitialized value in concatenation (.) or string at ./GetNPANXXCount.pl line 98.
b. Use of uninitialized value in numeric comparison (<=>) at ./GetNPANXXCount.pl line 101.
c. Use of uninitialized value in printf at ./GetNPANXXCount.pl line 102.
3.) It seems like there would be some room for improvement on this code for efficiency to speed things along, but I can't seem to find it. Any suggestions would be very appreciated!!!

Here's my code thus far:

#Sample Line From BIGFILE #{9999991234ff00aa},9999991234,1,"Y",0,0,{55760FFC56837F3E} my %minhash = (); my %npanxxhash = (); my @npanxxarray; my $key; my $value; my $npanxx; my $npanxxcnt; my $in = "BIGFILE.out.gz"; my $out_min = "npanxx_minsort.out"; my $out_cnt = "npanxx_cntsort.out"; open IN, "/bin/gunzip -c $in |" or die "IN: $!\n"; open OUT_MIN, ">", "$out_min" or die "OUT_MIN: $!\n"; open OUT_CNT, ">", "$out_cnt" or die "OUT_CNT: $!\n"; print "Time: " . time . "\n"; print "Processing $in...\n"; while (<IN>) { if ( $_ =~ m/^{.*$/ ) { #Grab 9999991234 from line above my ($a,$MIN,$c,$d,$e,$f) = split( /,/ ); $minhash{$MIN} = undef; } } close IN; print "Time: " . time . "\n"; print "Massaging Data...\n"; while ( ($key, $value) = each(%minhash) ) { #Get just 999999 from above $npanxx = substr($key, 0, 6); push(@npanxxarray, $npanxx); $npanxxhash{$npanxx} = undef; } undef $key; undef $value; print "Time: " . time . "\n"; print "Getting Counts...\n"; foreach $key (sort keys %npanxxhash) { &CountAndHash($key,\@npanxxarray,\%npanxxhash); # $npanxxcnt = grep (/$key/, @npanxxarray); # $npanxxhash{$key} = $npanxxcnt; } print "Time: " . time . "\n"; print "Generating Flat Files...\n"; foreach $key (sort keys %npanxxhash) { print OUT_MIN "$key $npanxxhash{$key}\n"; } foreach $key (sort { $npanxxhash{$a} <=> $npanxxhash{$b} } keys %npanx +xhash) { printf OUT_CNT "%-7s %s\n", $key, $npanxxhash{$key}; } print "Time: " . time . "\n"; print "Complete...\n"; sub CountAndHash { my ($key, $arrayref, $hashref) = @_; my %hashref; if (!defined(my $pid = fork())) { die "Cannot fork to child: $!\n"; } elsif ($pid == 0) { #print "Launching child process...\n"; $npanxxcnt = grep (/$key/, $arrayref); $hashref{$key} = $npanxxcnt; exit; } else { my $ret = waitpid($pid,0); print "PID $ret completed...\n"; } return ($npanxxcnt, $hashref); }

Thanks in advance for your help!!

UPDATE UPDATE UPDATE 2014-08-09

Thanks to one and all for your assistance. I am going to abandon this question as I have totally redone my code per aitap's suggestion below - but now I have a question related to the new code that's not pertinent here.

Thanks again for the help, monks!!

  • Comment on UPDATED: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
  • Download Code

Replies are listed 'Best First'.
Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by aitap (Curate) on Aug 08, 2014 at 15:32 UTC

    3.) It seems like there would be some room for improvement on this code for efficiency to speed things along, but I can't seem to find it. Any suggestions would be very appreciated!!!
    Unless I misunderstood your code, you don't use %minhash for anything except creating @npanxxarray and %npanxxhash. Maybe it's possible to replace these three variables with only one? Let's see. Processing large flat files is usually done within a single loop which reads the file and does the computations, caching as little data as possible. Why not trim and count your values as you get them? Like this:
    my %counts; while (<IN>) { next unless /^{.*$/; # skip non-matching lines my $min = (split ',')[1]; # get the value $counts{substr($min,0,6)} += 1; # found another one! } # at this point %counts is like your %npanxxhash, but without all the +temporary variables
    (code is untested, sorry)
    Counting values like this will take much less time than I/O required to read the file, so there is no need to use multi-threading here.

    Aside from that, child process can't modify variables in its parent, you would need to use threads and shared variables instead.

      aitap, you are a GENIUS!! It never occurred to me to go this route... consider me slapping my forehead. The "counts" array line was the key, and something I had not thought of! I am testing now, but it looks REAL promising!!

      Thank you SO much!!

Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by atcroft (Abbot) on Aug 08, 2014 at 14:52 UTC

    I have not looked through all of the code, but my thoughts at first glance were:

    1. look at Parallel::ForkManager for handling the forking, as it allows you to limit the number of forks, and newer versions allow you to pass data back to the parent easily, and
    2. as you appear to be looking initially at a CSV file, perhaps Text::CSV might make the field handling easier.

    Hope that helps.

Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by GotToBTru (Prior) on Aug 08, 2014 at 15:20 UTC

    You need to dereference the hashref in your subroutine.

    $hashref{$key} = $npanxxcnt;

    needs to be:

    $$hashref{$key} = $npanxxcnt;

    or

    ${$hashref}{$key} = $npanxxnct;

    Also, your subroutine returns values but they are not stored anywhere in the calling program.

    1 Peter 4:10
      Personally, I find
      $hashref->{$key} = $npanxxcnt;
      to be a more palatable form.
Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by graff (Chancellor) on Aug 09, 2014 at 15:46 UTC
    Apart from the points mentioned by others, the reason why the script isn't really doing anything in parallel is because on each iteration of this loop:
    foreach $key (sort keys %npanxxhash) { &CountAndHash($key,\@npanxxarray,\%npanxxhash); }
    The parent process in the subroutine is calling "waitpid" on its child process, and so it doesn't return until the child process is done. I don't do parallel stuff much - I hope the suggestion above about Parallel::ForkManager will be useful, but short of using that, I think the thing you might want to try is to have the subroutine return the pid of the child; push that onto an array or hash in the foreach loop, and then after that loop is done (while children are still running), call waitpid repeatedly until there are no more children pending. (Or something to that effect… again, I'm not an expert on this.)

    Also, this is a minor point, but on 5 MB million lines of input (any characters per line), the difference could be noticeable - instead of this:

    while (<IN>) { if ( $_ =~ m/^{.*$/ ) { #Grab 9999991234 from line above my ($a,$MIN,$c,$d,$e,$f) = split( /,/ ); $minhash{$MIN} = undef; } }
    Try this -- note the difference in the regex and split (the syntax changes are just a style preferences):
    while (<IN>) { next if ( /^{/ ); ## we only need to check the first character. #Grab 9999991234 from line above my $MIN = ( split /,/ )[1]; ## we only need to assign one variabl +e $minhash{$MIN} = undef; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1096766]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-26 01:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found