Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Adding parallel processing to a working Perl script

by Jim (Curate)
on Apr 21, 2014 at 07:59 UTC ( [id://1082989]=perlquestion: print w/replies, xml ) Need Help??

Jim has asked for the wisdom of the Perl Monks concerning the following question:

I'm struggling mightily to add parallel processing to an otherwise working script using Parallel::ForkManager. Here's the working script with the forking code commented out:

#!perl # # CountFilesRecords.pl use strict; use warnings; use Capture::Tiny qw( capture_stdout ); use English qw( -no_match_vars ); use File::Glob qw( bsd_glob ); #use Parallel::ForkManager; use Text::CSV_XS; @ARGV or die "Usage: perl $PROGRAM_NAME <export volume folder> ...\n"; # Expand globs... local @ARGV = map { $ARG =~ tr{\\}{/}; bsd_glob($ARG) } @ARGV; local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_AUTOFLUSH = 1; #my $MAXIMUM_BATCH_SIZE = 4; my @CSV_FIELD_LABELS = qw( ExportVolumeFolder TotalDATRecords TotalTextFiles TotalLFPRecords TotalImageFiles ); for my $volume_folder (@ARGV) { -d $volume_folder or die "Export volume folder $volume_folder doesn't exist\n"; } my @volume_folders; my %stuff_by; VOLUME_FOLDER: for my $volume_folder (@ARGV) { my $volume_name = (split m{/}, $volume_folder)[-1]; my $text_folder = "$volume_folder/TEXT"; my $images_folder = "$volume_folder/IMAGES"; my $dat_file = "$volume_folder/$volume_name.dat"; my $lfp_file = "$volume_folder/$volume_name.lfp"; # Check for completed export volumes, report incomplete ones... unless (-d $text_folder && -d $images_folder && -f $dat_file && -f + $lfp_file) { select STDERR; print $volume_folder; select STDOUT; next VOLUME_FOLDER; } push @volume_folders, $volume_folder; $stuff_by{$volume_folder} = { FOLDER_NAME => $volume_folder, TEXT_FILES => { COMMAND => qq( find "$text_folder" -type f -name "*.txt" | + wc -l ), COUNT => 0, }, IMAGE_FILES => { COMMAND => qq( find "$images_folder" -type f ! -name Thumb +s.db | wc -l ), COUNT => 0, }, DAT_RECORDS => { COMMAND => qq( wc -l "$dat_file" ), COUNT => 0, }, LFP_RECORDS => { COMMAND => qq( wc -l "$lfp_file" ), COUNT => 0, }, }; } # Quit if there are no completed export volume folders... exit 1 unless @volume_folders; my $csv = Text::CSV_XS->new(); # Print CSV header... $csv->print(\*STDOUT, \@CSV_FIELD_LABELS); #my $manager = Parallel::ForkManager->new($MAXIMUM_BATCH_SIZE); VOLUME_PROBE: for my $volume_folder (@volume_folders) { #$manager->start() and next VOLUME_PROBE; probe_volume($stuff_by{$volume_folder}); #$manager->finish(); } #$manager->wait_all_children(); exit 0; sub probe_volume { my $vol = shift; for my $stuff (qw( TEXT_FILES IMAGE_FILES DAT_RECORDS LFP_RECORDS +)) { (undef, $vol->{$stuff}{COUNT}) = capture_stdout { count_stuff($vol->{$stuff}{COMMAND}) }; } # The first line of every DAT file is a header $vol->{DAT_RECORDS}{COUNT}--; my @results = ( $vol->{FOLDER_NAME}, $vol->{DAT_RECORDS}{COUNT}, $vol->{TEXT_FILES}{COUNT}, $vol->{LFP_RECORDS}{COUNT}, $vol->{IMAGE_FILES}{COUNT} ); # Print CSV record... $csv->print(\*STDOUT, \@results); return; } sub count_stuff { my $command = shift; my $output = qx( $command ); my ($count) = $output =~ m/(\d+)/; return $count; }

I'm hoping some kind PerlMonk with experience using Parallel::ForkManager on Windows can spot the problem at a glance.

Thanks in advance for your gracious help.

UPDATE:  OK, I'm not wedded to Parallel::ForkManager. Is there a better way to manage parallel external processes (i.e., system calls to find and wc, capturing their standard output streams) without suffering this problem? Or is there a simple way to resolve the STDOUT problem using Parallel::ForkManager? I don't want to have to rewrite the whole script, which otherwise works fine, and I don't want to abandon using Capture::Tiny.

Replies are listed 'Best First'.
Re: Adding parallel processing to a working Perl script
by BrowserUk (Patriarch) on Apr 21, 2014 at 13:10 UTC

    Are you aware that under windows, fork, including that used by Parallel::ForkManager is emulated (not well) using threads?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Adding parallel processing to a working Perl script
by zentara (Archbishop) on Apr 21, 2014 at 12:02 UTC
    What error messages, if any do you get when you try to incorporate the forking? What do you mean, it dosn't work?

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      Thank you for your reply, zentara.

      Ah, this is my dilemma. I get profoundly odd, inconsistent behavior.

      First, here's a successful run with the forking code commented out:

      C:\>perl CountFilesRecords.pl ABCD\???\Z* ExportVolumeFolder,TotalDATRecords,TotalTextFiles,TotalLFPRecords,Tota +lImageFiles ABCD/001/Z000003014V001,6,6,88,88 ABCD/002/Z000003015V001,66,66,201,201 ABCD/003/Z000003079V001,1,1,27,27 ABCD/004/Z000003080V001,1,1,32767,32767 ABCD/005/Z000003081V001,1,1,14297,14297 ABCD/006/Z000002503V001,9,9,45,45 ABCD/007/Z000002780V001,2106,2106,2907,2907 ABCD/008/Z000003020V001,49,49,51,51 ABCD/009/Z000003021V001,5,5,6,6 ABCD/010/Z000003069V001,2851,2851,4576,4576 ABCD/011/Z000003071V001,1259,1259,3242,3242 ABCD/012/Z000005594V001,439,439,708,708 ABCD/013/Z000003140V001,1,1,25,25 ABCD/014/Z000003141V001,1,1,275,275 ABCD/015/Z000003142V001,2,2,14,14 ABCD/016/Z000003143V001,1,1,36,36 ABCD/017/Z000003144V001,10,10,316,316 ABCD/018/Z000003145V001,2,2,835,835 C:\>

      Now, here's a sequence of runs, one immediately after the other, with the forking code restored:

      C:\>perl CountFilesRecords.pl ABCD\???\Z* ExportVolumeFolder,TotalDATRecords,TotalTextFiles,TotalLFPRecords,Tota +lImageFiles ABCD/001/Z000003014V001,6,,88, ABCD/005/Z000003081V001,1,,14297, ABCD/004/Z000003080V001,1,1,32767,32767 C:\>perl CountFilesRecords.pl ABCD\???\Z* ExportVolumeFolder,TotalDATRecords,TotalTextFiles,TotalLFPRecords,Tota +lImageFiles ABCD/004/Z000003080V001,1,1,32767,32767 C:\>perl CountFilesRecords.pl ABCD\???\Z* ExportVolumeFolder,TotalDATRecords,TotalTextFiles,TotalLFPRecords,Tota +lImageFiles ABCD/002/Z000003015V001,-1,,201,201 ABCD/003/Z000003079V001,1,1,,27 ABCD/001/Z000003014V001,6,6,88,88 ABCD/006/Z000002503V001,9,9,45,45 ABCD/008/Z000003020V001,-1,49,51,51 ABCD/009/Z000003021V001,5,5,6,6 ABCD/007/Z000002780V001,2106,2106,2907,2907 ABCD/011/Z000003071V001,1259,1259,,3242 ABCD/012/Z000005594V001,439,439,,708 Error from open(IO::Handle=GLOB(0x2acc2e8), <&STDIN): Bad file descrip +tor at C:/Perl64/site/lib/Capture/Tiny.pm line 99 Capture::Tiny::_open('IO::Handle=GLOB(0x2acc2e8)', '<&STDIN') +called at C:/Perl64/site/lib/Capture/Tiny.pm line 176 Capture::Tiny::_copy_std() called at C:/Perl64/site/lib/Captur +e/Tiny.pm line 346 Capture::Tiny::_capture_tee(1, 0, 0, 0, 'CODE(0x2a1e7a8)') cal +led at CountFilesRecords.pl line 113 main::probe_volume('HASH(0x2962728)') called at CountFilesReco +rds.pl line 99 ABCD/004/Z000003080V001,1,,32767, C:\>perl CountFilesRecords.pl ABCD\???\Z* ExportVolumeFolder,TotalDATRecords,TotalTextFiles,TotalLFPRecords,Tota +lImageFiles ABCD/001/Z000003014V001,6,6,88,88

      You can plainly see the bizarre, inconsistent output from one run to the next. The last run stalled. In fact, it's still running as I type this, neither finishing nor producing more output.

      This is why I'd hoped some kind brother on PerlMonks—one who has much more experience using Parallel::ForkManager than I do—might look at this script and immediately recognize what's wrong with it.

        Error from open(IO::Handle=GLOB(0x2acc2e8), <&STDIN): Bad file descrip +tor at C:/Perl64/site/lib/Capture/Tiny.pm line 99

        The problem is that each of the 'fork's, it re-opening the same glob for stdin. Under unix, this works because each fork is a new process, so the re-used glob is unique within its own process space.

        But under windows, each 'fork' is actually just a separate thread, within the same process space, so the re-used glob -- despite that it is cloned at the Perl level -- is trying to concurrently reuse the same underlying per-process OS buffers and data-structures; with the inevitable consequences.

        When it works, it is by pure chance. Mostly it won't.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Adding parallel processing to a working Perl script
by ww (Archbishop) on Apr 21, 2014 at 12:13 UTC

    Tell us more: what DOES happen when you don't comment out the code; tell us, as zentara asked, exactly what error messages you get, and to which I add, "AND DESCRIBE WHAT HAPPENS."


    Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
    1. code
    2. verbatim error and/or warning messages
    3. a coherent explanation of what "doesn't work actually means.

    check Ln42!

      Thank you for your reply, ww.

      Please see my follow up reply to zentara above. I'm hoping one of you can spot the root cause of the problem.

Re: Adding parallel processing to a working Perl script
by Preceptor (Deacon) on Apr 22, 2014 at 10:16 UTC

    Personally - I would be VERY wary of trying to retrofit threading. There are many big scary bugs that are lurking within to bite you. I would strongly recommend that you assume you'll need a rewrite from scratch, and then borrow from your original source.

    It at least looks like you're dealing with an implicitly parallel problem, so I would suggest:

    • Redraft your code such that you have a 'worker' subroutine, which handles one thing at a time. (Multiple are OK, if you've different cases to handle and you want to parallelise)
    • Use Thread::Queue, and 'feed' your worker with a queue. (unthreaded).
    • Consider your 'worker' sub as a thread, and spawn multiple

    Here's a really basic template for what I mean: A basic 'worker' threading example.

    (Not to denigrate the perfectly sound advice other Monks have offered. This is purely my opinion as to how I would approach your problem)

      This is a terrific response. Thank you very much, Preceptor. Your post titled A basic 'worker' threading example is exactly the kind of beginning Perl threads tutorial I was looking for. I'll study it this weekend and then try to apply its lessons to my application.

      Redraft your code such that you have a 'worker' subroutine, which handles one thing at a time.

      Here's my refactored code. My intention was to make it readily adaptable to threading. The intended 'worker' subroutine is probe_volume(). I've probably missed the mark entirely, but with guidance from you and other kind monks, I'm hoping I can finally write my first truly useful parallel program.

        I think you may still be trying to pass a bit too much back and forth. Thread::Queue is a lovely way of handling queuing, but it works best with single values. You're passing a hash into probe_volume - which works single threaded, but can get quite complicated if multithreading.

        I think you need to step back a little and consider the design - threading increases throughput by parallelism, but as a result means that each of your threads occur asynchronously and non deterministically - you will never know which order your threads will complete tasks in. You therefore can't do something like 'print probe_volume' - you'll have to collate your data and (potentially) reorder it first.

        You will also need to think about sharing variables - you pass a hash into probe_volume, and return a list. This will probably cause you pain. Sharing variables between threads is potentially quite complicated and a source of some really annoying bugs. Try to avoid doing it.

        I would therefore suggest that what you want is a 'standalone' probe_volume subroutine that takes _just_ a volume name (either passed via sub call, but ideally 'fed' through a Thread::Queue). And outputs (again, returning via sub call, or Thread::Queue) the results, but without using anything from the global namespace. (Read only access to e.g. command definitions would be ok)

Re: Adding parallel processing to a working Perl script
by sundialsvc4 (Abbot) on Apr 21, 2014 at 13:57 UTC

    And, what if you simply opened up several Terminal windows and ran a single command-line command, one per volume (or whatever), in each of those windows?   Instead of taking an already-working script and “parallelizing it,” could you not by other means simply run instances of that unmodified script “in parallel,” using command line parameters to distinguish them?

      Thank you for your reply, sundialsvc4.

      Yes, this is the sort of thing I'm doing now.

      start "01" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"A*/Z* 1^> 0 +1.csv 2^> 01.txt start "02" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"B*/Z* 1^> 0 +2.csv 2^> 02.txt start "03" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"C*/Z* 1^> 0 +3.csv 2^> 03.txt start "04" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"D*/Z* 1^> 0 +4.csv 2^> 04.txt start "05" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"E*/Z* 1^> 0 +5.csv 2^> 05.txt start "06" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"F*/Z* 1^> 0 +6.csv 2^> 06.txt start "07" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"G*/Z* 1^> 0 +7.csv 2^> 07.txt start "08" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"H*/Z* 1^> 0 +8.csv 2^> 08.txt start "09" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"I*/Z* 1^> 0 +9.csv 2^> 09.txt start "10" cmd /c perl CountFilesRecords.pl "ABCD/foo bar/"J*/Z* 1^> 1 +0.csv 2^> 10.txt

      Blech!

      It's precisely what I want to automate using Perl. It seems to me the best way to solve the problem is to add parallel processing of the folders and files inside the counting Perl script itself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1082989]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-03-19 08:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found