Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Although still very much a perl newbie, I'm starting to become more and more comfortable with the language, and have started experimented with parallel processing. For fun I've previously put together this simple script that will read in a list of search terms in a file (one per line) and a bunch of .pdf files, and search the .pdf files for the search terms, printing out a csv list consisting of the term and the .pdf in which it was found. Now, I'd like to try to parallel process the script to speed the process up as a way to understand how to do this kind of thing, and it seemed like using the Parallel::Loops module was a good start. However, I don't seem to be getting the same results from the single core trial as I am the multicore trial, and I think it's because I'm storing the results in a HoA.

From the CPAN page "Also, if two different children set a value for the same key, a random one of them will be seen by the parent." So, it might be this is why the results are not correct. Does anyone have a suggestion on how I can do this? I can't seem to figure out how to ultimately store all of the results into a master results HoA that I can then print out at the very end.

Here's the script, which I've made so that you can swap between running it in single core mode and multicore mode. I'm not including my test data. But, I'm just using a generic set of terms (literally stuff like "this", "and", "command", etc.) and any old .pdf files I have laying around. But, if you need more info or data, let me know.

#!/usr/bin/perl # Read in a file with search terms and a list of .pdf files, and outpu +t the search term and the .pdf file the term # was found in as a .csv output. Can be expanded to potentially also +list use warnings; use strict; use Parallel::Loops; use Data::Dumper; my $lookup_file = shift; my $par = shift; my @pdf_files = @ARGV; my %results; # Read in the lookup file and store query terms in hash (just in case +can be useful later) open( my $lookup_fh, "<", $lookup_file ) || die "Can't open the lookup + file '$lookup_file': $!"; my %lookup = map{ chomp; $_ => 1 } <$lookup_fh>; close( $lookup_fh ); if ( $par == 1 ) { &par_proc( \%lookup, \@pdf_files, \%results ); } else { &std_proc( \%lookup, \@pdf_files ); } # Print out the results for my $search_term ( sort keys %results ) { print join( ",", $search_term, $_ ), "\n" for ( sort { $a cmp $b } + @{$results{$search_term}} ); } sub std_proc { # Iterate over search terms to look up files and spit results out to a + new hash my $lookup = shift; my $pdfs = shift; foreach ( @$pdfs ) { my $pdf = $_; open( my $pdf_fh, "-|", "pdftotext $pdf -" ) || die "Error con +verting file: $pdf"; my @data = <$pdf_fh>; for my $search_term ( keys %$lookup ) { push( @{$results{$search_term}}, $pdf ) if ( grep { $_ =~ +/$search_term/ } @data ); } } } sub par_proc { # Parallel iteration method my $lookup = shift; my $pdfs = shift; my $results = shift; # Set up parallel loop processing. my $maxProcs = 12; my $pl = Parallel::Loops->new($maxProcs); $pl->share($results); $pl->foreach ( $pdfs, sub { my $pdf = $_; open( my $pdf_fh, "-|", "pdftotext $pdf -" ) || die "Error co +nverting file: $pdf"; my @data = <$pdf_fh>; for my $search_term ( keys $lookup ) { push( @{$results{$search_term}}, $pdf ) if ( grep { $_ =~ + /$search_term/ } @data ); } }); }

In reply to Parallel::Loops and HoAs by drmrgd

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (5)
    As of 2018-04-21 05:07 GMT
    Find Nodes?
      Voting Booth?