Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Parallel::Loops and HoAs

by drmrgd (Beadle)
on Sep 29, 2013 at 21:50 UTC ( #1056260=perlquestion: print w/replies, xml ) Need Help??
drmrgd has asked for the wisdom of the Perl Monks concerning the following question:

Although still very much a perl newbie, I'm starting to become more and more comfortable with the language, and have started experimented with parallel processing. For fun I've previously put together this simple script that will read in a list of search terms in a file (one per line) and a bunch of .pdf files, and search the .pdf files for the search terms, printing out a csv list consisting of the term and the .pdf in which it was found. Now, I'd like to try to parallel process the script to speed the process up as a way to understand how to do this kind of thing, and it seemed like using the Parallel::Loops module was a good start. However, I don't seem to be getting the same results from the single core trial as I am the multicore trial, and I think it's because I'm storing the results in a HoA.

From the CPAN page "Also, if two different children set a value for the same key, a random one of them will be seen by the parent." So, it might be this is why the results are not correct. Does anyone have a suggestion on how I can do this? I can't seem to figure out how to ultimately store all of the results into a master results HoA that I can then print out at the very end.

Here's the script, which I've made so that you can swap between running it in single core mode and multicore mode. I'm not including my test data. But, I'm just using a generic set of terms (literally stuff like "this", "and", "command", etc.) and any old .pdf files I have laying around. But, if you need more info or data, let me know.

#!/usr/bin/perl # Read in a file with search terms and a list of .pdf files, and outpu +t the search term and the .pdf file the term # was found in as a .csv output. Can be expanded to potentially also +list use warnings; use strict; use Parallel::Loops; use Data::Dumper; my $lookup_file = shift; my $par = shift; my @pdf_files = @ARGV; my %results; # Read in the lookup file and store query terms in hash (just in case +can be useful later) open( my $lookup_fh, "<", $lookup_file ) || die "Can't open the lookup + file '$lookup_file': $!"; my %lookup = map{ chomp; $_ => 1 } <$lookup_fh>; close( $lookup_fh ); if ( $par == 1 ) { &par_proc( \%lookup, \@pdf_files, \%results ); } else { &std_proc( \%lookup, \@pdf_files ); } # Print out the results for my $search_term ( sort keys %results ) { print join( ",", $search_term, $_ ), "\n" for ( sort { $a cmp $b } + @{$results{$search_term}} ); } sub std_proc { # Iterate over search terms to look up files and spit results out to a + new hash my $lookup = shift; my $pdfs = shift; foreach ( @$pdfs ) { my $pdf = $_; open( my $pdf_fh, "-|", "pdftotext $pdf -" ) || die "Error con +verting file: $pdf"; my @data = <$pdf_fh>; for my $search_term ( keys %$lookup ) { push( @{$results{$search_term}}, $pdf ) if ( grep { $_ =~ +/$search_term/ } @data ); } } } sub par_proc { # Parallel iteration method my $lookup = shift; my $pdfs = shift; my $results = shift; # Set up parallel loop processing. my $maxProcs = 12; my $pl = Parallel::Loops->new($maxProcs); $pl->share($results); $pl->foreach ( $pdfs, sub { my $pdf = $_; open( my $pdf_fh, "-|", "pdftotext $pdf -" ) || die "Error co +nverting file: $pdf"; my @data = <$pdf_fh>; for my $search_term ( keys $lookup ) { push( @{$results{$search_term}}, $pdf ) if ( grep { $_ =~ + /$search_term/ } @data ); } }); }

Replies are listed 'Best First'.
Re: Parallel::Loops and HoAs
by Athanasius (Chancellor) on Sep 30, 2013 at 14:21 UTC

    Hello drmrgd,

    Your diagnosis is correct: the same hash key is being assigned-to by two or more children, but only one such assignment makes it back to the parent.

    Assuming your pdf files are unique — and noticing that each pdf file is processed exactly once — one strategy for tackling this problem is to reverse the order of keys and values when assigning to the %results hash, and then invert the hash once all the processing has completed. (Update: This avoids the multiple-assignment problem, because each new hash entry’s key is guaranteed to be unique to the thread in which it is added.)

    Here is a proof-of-concept implementation:


    • Your problem has nothing to do with reading the text from pdf files. You would probably have received a faster response by presenting a mimimal script demonstrating the problem you are seeing (see How do I post a question effectively?).
    • Don’t call subroutines with a prepended & unless you have a good reason to circumvent prototypes (and you don’t!). See perlsub.
    • $_ =~ /$search_term/ can be written more succinctly as just /$search_term/.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Ahhh! That works perfectly! I thought I had tried that without success originally, but I clearly didn't. Thanks so much for the help in understanding this. This is definitely encouraging me to dabble more with multi-core processing of my work (which will be a big help and improvment!).

      Also thanks for the tips and notes. As a burgeoning perl neophyte, I'm still learning my way around, and your notes are certainly helpful. Thank you again!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1056260]
Approved by mtmcc
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (8)
As of 2018-04-26 14:03 GMT
Find Nodes?
    Voting Booth?