Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Parallel-processing the code

by marioroy (Prior)
on May 17, 2018 at 04:23 UTC ( [id://1214706]=note: print w/replies, xml ) Need Help??


in reply to Parallel-processing the code

Hi rajaman,

Hello :) Unfortunately, life is getting shorter and have learned to skip threads like this one whenever test data is omitted. The reason is due to lack of time. Sorry. That said, the demonstration that follows is not tested.

#!/usr/bin/perl use strict; use warnings; use Data::Dumper qw(Dumper); use re::engine::RE2; use List::MoreUtils qw(uniq); use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Make gather routine for the manager process. It returns a # closure block for preserving append-order as if processing # serially. my %hashunique; sub make_gather { my ($order_id, %tmp) = (1); return sub { my ($chunk_id, $hashref) = @_; $tmp{$chunk_id} = $hashref; while (exists $tmp{$order_id}) { $hashref = delete $tmp{$order_id}; for my $k (keys %{ $hashref }) { unless (exists $hashunique{$k}) { $hashunique{$k} = $hashref->{$k}; } else { $hashunique{$k} = $hashunique{$k}.'|'.$hashref->{$ +k}; } } $order_id++; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @one = split /\n/, $_; my ($indexofdashinarray) = grep { $one[$_] =~ /\-\-/ } 0..$#on +e; for my $i (1..$#one) { next if $one[$i] =~ /^\-\-$/; while ($one[$i] =~ m/(\b)D\*(.*?)\*(.*?)\*D(\b)/g) { unless (exists $localunique{"D$2"}) { $localunique{"D$2"} = "$3"; } else { $localunique{"D$2"} = $localunique{"D$2"}.'|'."$3" +; } } } } close RF; # Each worker must call gather one time when preserving order # is desired which is the case for this demonstration. MCE->gather($chunk_id, \%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 3, input_data => $inputfile1, chunk_size => 2 * 1024 * 1024, # 2 MiB RS => '', # important, blank line, paragraph break gather => make_gather(), user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join ("\|", uniq split /\|/ , $hashunique{$k}); print WF "$k=>$hashunique{$k}\n"; } close WF;

Regards, Mario

Replies are listed 'Best First'.
Re^2: Parallel-processing the code
by rajaman (Sexton) on May 17, 2018 at 19:34 UTC
    Thanks very much Mario and others for your valuable input.

    I tried running your code, but it is generating blank output.

    I am appending below input and output file formats: In input file there are over 1000000 chunks of sentences (e.g. user review), with chunks separated by a blank line (shown below). I am trying to extract some pre-tagged patterns from the sentences. Such as, extract D*ID1*Spore1 game*D from sentence and then separate ID of the game from its name; all names later are concatenated as shown in the output format below.

    Please let me know how your MCE-based code needs to be modified.

    Thanks once again.

    
    Input file format:
    1
    --
    A new DVD with both the PC and Mac release for EA's D*ID1*Spore1 game*D.
    D*ID2*Spore2*D is not that type of game.
    That is why I gave D*ID1*Spore1*D a 3 star.
    
    2
    --
    D*ID2*Spore2*D is a wonderful game.
    A new DVD with both the PC and Mac release for EA's D*ID1*Spore1*D.
    
    3
    --
    Once you get the D*ID1*spore1*D cursor on your screen, click command-Q.
    .
    .
    
    Output format:
    ID1=>Spore1 game|Spore1|spore1 #case sensitive unique names only in hash value
    ID2=>Spore2
    
    

      Hi rajaman,

      I am appending below input and output file formats...

      Great! I made two demonstrations entirely hash-key driven (2-levels). The serial code, based on ikegami's demonstration, may be fast enough for your use case. The parallel demonstration may run two times faster or more. Gather order is not necessary. Be sure to have Sereal installed for maximum performance.

      Both demonstrations produce the same output.

      Serial Code

      #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique; open RF, "<", $inputfile1 or die "Can't open $inputfile1: $!"; local $/ = ''; # blank line, paragraph break while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$#line +s; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $hashunique{"D$1"}{$2} = undef; } } } close RF; # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

      Parallel Code

      #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Gather routine for the manager process. my %hashunique; sub gather { my ($hashref) = @_; for my $k1 (keys %{$hashref}) { for my $k2 (keys %{$hashref->{$k1}}) { $hashunique{$k1}{$k2} = undef; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$# +lines; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $localunique{"D$1"}{$2} = undef; } } } close RF; # Call gather outside the loop. MCE->gather(\%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 4, input_data => $inputfile1, chunk_size => 1 * 1024 * 1024, # 1 MiB RS => '', # important, blank line, paragraph break gather => \&gather, user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

      Regards, Mario

        That's very helpful Mario. Thanks a lot!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1214706]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-03-28 20:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found