Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Parallel-processing the code

by rajaman (Novice)
on May 16, 2018 at 18:36 UTC ( #1214682=perlquestion: print w/replies, xml ) Need Help??
rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am processing a text file (code shown below), chunk by chunk. For each text chunk, I do some processing shown in the while loop in the code below. I am collecting the result of the processing in a hash (%hashunique). How can I do parallel processing on this code. For example, run in parallel 10 instances of while loop, each processing 10 different chunks of text from the input file. At the end of processing, all results are saved in %hashunique.

I checked some modules, but could not figure out how to apply these on my code below.

Thanks a lot!

#!/usr/bin/perl use strict; use warnings; use Data::Dumper qw(Dumper); use re::engine::RE2; use List::MoreUtils qw(uniq); use Sort::Naturally; #This program reads abstract sentence file and produces output with th +e following format: # if ($#ARGV != 1) { print "usage: program arguments\n"; } my $inputfile1=$ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique=(); open(RF, "$inputfile1") or die "Can't open < $inputfile1: $!"; open(WF, ">$outputfile"); #open for output $/ = ''; #this sets the delimiter for an empty line while (<RF>) { my @one = split /\n/ , $_; my ( $indexofdashinarray ) = grep { $one[$_] =~ /\-\-/ } 0..$#one; for (my $i=0;$i<=$#one;$i++) { next if $i==0; next if $one[$i] =~ /^\-\-$/; while ($one[$i] =~ m/(\b)D\*(.*?)\*(.*?)\*D(\b)/g) { unless ($hashunique{"D$2"}) { $hashunique{"D$2"}="$3"; } else { $hashunique{"D$2"}=$hashunique{"D$2"}.'|'."$3"; } } } } foreach my $i (nsort keys %hashunique) { $hashunique{$i} = join ( "\|", uniq split /\|/ , $hashunique{$i}); print WF "$i=>$hashunique{$i}\n"; } close (RF); close (WF);

Replies are listed 'Best First'.
Re: Parallel-processing the code
by ikegami (Pope) on May 17, 2018 at 00:52 UTC

    As previously mentioned, multi-tasking won't help. This prorgram is I/O-bound, and merging the results of the threads would be as expensive as building the results in the first place.

    I just wanted to provide a cleaned up version of your code (with a few micro-optimizations).

    #!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Sort::Naturally qw( nsort ); local $/ = ''; # Paragraph mode reads until a blank line. my %grouped; while (<>) { my @lines = split /\n/, $_; for (@lines[1..$#lines]) { next if $_ eq '--'; # Omit if rarely true. ++$grouped{"D$1"}{$2} while /\bD\*([^*]*)\*([^*]*)\*D\b/g; } } for my $k (nsort keys %grouped) { say "$k=>" . join("|", keys(%{ $grouped{$k} }); }
      Thank you Ikegami. New thing learned there.
Re: Parallel-processing the code
by marioroy (Priest) on May 17, 2018 at 04:23 UTC

    Hi rajaman,

    Hello :) Unfortunately, life is getting shorter and have learned to skip threads like this one whenever test data is omitted. The reason is due to lack of time. Sorry. That said, the demonstration that follows is not tested.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper qw(Dumper); use re::engine::RE2; use List::MoreUtils qw(uniq); use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Make gather routine for the manager process. It returns a # closure block for preserving append-order as if processing # serially. my %hashunique; sub make_gather { my ($order_id, %tmp) = (1); return sub { my ($chunk_id, $hashref) = @_; $tmp{$chunk_id} = $hashref; while (exists $tmp{$order_id}) { $hashref = delete $tmp{$order_id}; for my $k (keys %{ $hashref }) { unless (exists $hashunique{$k}) { $hashunique{$k} = $hashref->{$k}; } else { $hashunique{$k} = $hashunique{$k}.'|'.$hashref->{$ +k}; } } $order_id++; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @one = split /\n/, $_; my ($indexofdashinarray) = grep { $one[$_] =~ /\-\-/ } 0..$#on +e; for my $i (1..$#one) { next if $one[$i] =~ /^\-\-$/; while ($one[$i] =~ m/(\b)D\*(.*?)\*(.*?)\*D(\b)/g) { unless (exists $localunique{"D$2"}) { $localunique{"D$2"} = "$3"; } else { $localunique{"D$2"} = $localunique{"D$2"}.'|'."$3" +; } } } } close RF; # Each worker must call gather one time when preserving order # is desired which is the case for this demonstration. MCE->gather($chunk_id, \%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 3, input_data => $inputfile1, chunk_size => 2 * 1024 * 1024, # 2 MiB RS => '', # important, blank line, paragraph break gather => make_gather(), user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join ("\|", uniq split /\|/ , $hashunique{$k}); print WF "$k=>$hashunique{$k}\n"; } close WF;

    Regards, Mario

      Thanks very much Mario and others for your valuable input.

      I tried running your code, but it is generating blank output.

      I am appending below input and output file formats: In input file there are over 1000000 chunks of sentences (e.g. user review), with chunks separated by a blank line (shown below). I am trying to extract some pre-tagged patterns from the sentences. Such as, extract D*ID1*Spore1 game*D from sentence and then separate ID of the game from its name; all names later are concatenated as shown in the output format below.

      Please let me know how your MCE-based code needs to be modified.

      Thanks once again.

      
      Input file format:
      1
      --
      A new DVD with both the PC and Mac release for EA's D*ID1*Spore1 game*D.
      D*ID2*Spore2*D is not that type of game.
      That is why I gave D*ID1*Spore1*D a 3 star.
      
      2
      --
      D*ID2*Spore2*D is a wonderful game.
      A new DVD with both the PC and Mac release for EA's D*ID1*Spore1*D.
      
      3
      --
      Once you get the D*ID1*spore1*D cursor on your screen, click command-Q.
      .
      .
      
      Output format:
      ID1=>Spore1 game|Spore1|spore1 #case sensitive unique names only in hash value
      ID2=>Spore2
      
      

        Hi rajaman,

        I am appending below input and output file formats...

        Great! I made two demonstrations entirely hash-key driven (2-levels). The serial code, based on ikegami's demonstration, may be fast enough for your use case. The parallel demonstration may run two times faster or more. Gather order is not necessary. Be sure to have Sereal installed for maximum performance.

        Both demonstrations produce the same output.

        Serial Code

        #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique; open RF, "<", $inputfile1 or die "Can't open $inputfile1: $!"; local $/ = ''; # blank line, paragraph break while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$#line +s; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $hashunique{"D$1"}{$2} = undef; } } } close RF; # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

        Parallel Code

        #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Gather routine for the manager process. my %hashunique; sub gather { my ($hashref) = @_; for my $k1 (keys %{$hashref}) { for my $k2 (keys %{$hashref->{$k1}}) { $hashunique{$k1}{$k2} = undef; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$# +lines; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $localunique{"D$1"}{$2} = undef; } } } close RF; # Call gather outside the loop. MCE->gather(\%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 4, input_data => $inputfile1, chunk_size => 1 * 1024 * 1024, # 1 MiB RS => '', # important, blank line, paragraph break gather => \&gather, user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

        Regards, Mario

Re: Parallel-processing the code
by Marshall (Abbot) on May 17, 2018 at 00:17 UTC
    Your application doesn't appear to be well suited for parallel processing. That's because you have a single input stream, what appears to be minimal processing and a single output DB that each thread or process would have to interact heavily with.

    I get from your question that the root problem is "I want my application to run faster". You assume that multi-processing is the answer for that and you are asking us how to do it.

    Let's back up a bit. Without sample data, it is a bit hard for me to figure out exactly what you are doing. How long does this app take? How big is the input file? What triggers a "new run"? It could be that if say 90% of the data is the same between runs and 10% is "new", some strategy that saves the results from the 90% that was the same from last time will result in getting the complete results faster? Anyway before jumping to a "solution", I'd like to understand a bit more about what you are doing...

    It could also be that some relatively minor tweaks to your code could provide some performance increase, although I doubt anything truly dramatic.

    But for example, I have one application that calculates some results hourly in a very efficient manner. At the end of the fiscal year, we recalculate everything and it takes about 6 hours. Nobody cares that one year's of data takes 6 hours to process because that work is spread out in very small increments on an hourly basis throughout the year. Update: we do the complete reprocessing as a "double check" on the incremental process. In theory the results are the same.

Re: Parallel-processing the code
by sundialsvc4 (Abbot) on May 16, 2018 at 21:03 UTC

    As appealing as such a notion might be at first blush, in my experience the results turn out to be disappointing.   Basically, this sort of algorithm is I/O-bound, limited in its execution time by the speed of the disk drive and associated drivers and nothing else.   The CPU is loafing.   If you now inject multiple processes or threads into the mix, you can actually make things worse because the disk drive is now faced with a much more unpredictable situation ... odds are that it will now be moving the read/write head back and forth much more frequently than if it were servicing requests from only one worker.   A CPU that thinks in nanoseconds is now waiting for multiple milliseconds:   one or more Ferraris, all stuck in traffic right next to a Yugo who might be moving along faster than they are.

    The fact that you are now updating a single shared data-structure is another pinch-point.   Although of course Perl can do this reliably, the workers are now obliged to wait, not only for the disk-drive, but also for one another.

    The best approach, very-recently discussed here, is to leverage the operating system’s built-in file buffering mechanisms as aggressively as possible, so that data is read-in from the disk “in great big gulps,” and devising the entire algorithm to minimize the need to move the read/write mechanism to some other cylinder on the platter.   “When I/O is necessary, make it count.”  

    Just my two cents ...

      Hi sundialsvc4,

      What may be true years ago may be irrelevant today. Batching minimizes IPC overhead. MCE workers do not read input data simultaneously. Instead, MCE workers do parallel and serial processing automatically. Sometime, there is no involvement of the manager process as well. That said, there is a possibility that parallel may work.

      Kind regards, Mario

      It's not usually so simple. Modern physical disks are intelligent devices, and giving them more parallel requests can allow them to optimize the head movement and lead to more global throughput at the expense of response time for an individual thread. Storage arrays and RAID levels can make a huge difference here, even if it is a single filesystem.

      That said, rajaman didn't give any information on the hardware he is using, so anything we say about how to maximize his I/O throughput is just speculation generalizations.

      PS: My issue isn't with whether or not parallelism will help with this particular problem, but rather the generalization that I/O bound processes can't generally benefit from parallelism. Storage manufacturers, OS developers and Systems Administrators put a lot of effort into making storage work better for different workloads, so you can sometimes be surprised by what your storage can do if you put in a little effort.

        Modern physical disks are intelligent devices, and giving them more parallel requests can allow them to optimize the head movement and lead to more global throughput at the expense of response time for an individual thread. Storage arrays and RAID levels can make a huge difference here, even if it is a single filesystem.

        SSDs don't have any heads that need to be mechanically moved around, and they don't have to wait for the data on the disk to appear at the read head, so the seek times reduce to (nearly) zero. Therefore, o you gain even more from parallel requests to SSDs.

        Also, your application rarely talks directly to the disk. The operating system is also trying to optimize disk access, by caching and reordering requests.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        I don't agree with many (most) of sundialsvc4's posts, but in this specific case, at least some of it makes sense, please don't throw the baby out with the bathwater. Especially the difference between IO-bound processes and CPU-bound processes does make sense, even though modern CPU with, among other things, their multiple cores and advanced multi-layered caching strategies, make it very difficult to predict performance. Only benchmarking can really sort out these things.

        But, yes, the difference between IO-bound processes and CPU-bound processes is still a good starting point to try to understand what is going on.

        “Of course you are entitled to your Anonymous opinion,” but I daresay that my opinion has merit.   All computer data-processing scenarios are either “fundamentally I/O-bound,” or “fundamentally CPU-bound,” and you’d better decide very early “which is which” in your particular case.   Because, if you guess wrongly, you make it worse.   Perhaps much worse.   Perhaps(!), extremely much worse.

        We as programmers tend to conveniently forget that there really is a piece of physical hardware out there, lurking just beyond the boundaries of our precious CPU.   Yet, uncomfortably too-often, that is what is actually making us stay up past our bedtime as we once-again curse the dreaded pager.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1214682]
Approved by haukex
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2018-07-22 01:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (451 votes). Check out past polls.

    Notices?