Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

creating unknown number of threads and then join results

by rudds_perl_habit (Novice)
on Jul 29, 2013 at 22:35 UTC ( #1046925=perlquestion: print w/ replies, xml ) Need Help??
rudds_perl_habit has asked for the wisdom of the Perl Monks concerning the following question:

I have some successful scripts that use threads, but they all have a set number of threads that they create. Now I am trying to create a script that will create X number of threads where the number of threads is determined by the number of search directories. In this particular case, each thread runs a "cleartool find" command on a directory to get an array of results back. For this example I am just using the unix find command. But the "cleartool find" command in ClearCase is similar, but takes a lot longer to run.

So what I am finding is that on small data it seems to work fine. I get consistent results. But on those really long running clearcase commands, I don't always get all the data I expect in the @Final array. There is probably a way to do this better... maybe locking the variable before I update it? I was thinking that each thread is updating a different hash key of the variable, so it should be safe to update this way? Or does it need to be locked before each join statement? Any suggestions on how to do this better?

#!/usr/local/bin/perl use Cwd; use threads; use Data::Dumper; my $use_cc = 0; my @dirs = (); if ( $use_cc ) { @dirs = split(/\s+/, $ENV{CLEARCASE_AVOBS}); } else { @dirs = qw(/bin /sbin /usr/local/bin /usr/sfw/bin /usr/bin); } # it's a clearcase thing my $branch = "v4.0.0_gxp_patch"; # hash of dir names with thread values my %threads = (); # hash of dir names with arrays of found items my %Found = (); # large arry to hold all results my @Final = (); foreach my $dir ( sort @dirs ) { chomp($dir); # add dir name to hash $Found{$dir} = (); # create thread and add it to threads hash $threads{$dir} = threads->create({'context' => 'list'}, 'find_thread +', $dir, $use_cc, $branch); } foreach my $dir ( sort keys %threads ) { # cycle through threads hash and join up results, put them in hash-o +f-arrays @{ $Found{$dir} } = $threads{$dir}->join(); } # still all the smaller hash-of-arrays into a large array for easier p +rocessing later on foreach my $dir ( sort keys %Found ) { foreach my $item ( sort @{ $Found{$dir} } ) { push(@Final, $item); } } print Dumper(@Final); print "SIZE: " . scalar(@Final) . "\n"; sub find_thread { my $dir = shift; my $cc_flag = shift; my $branch = shift; my @results; chdir $dir or die "Cannot change to $dir\n"; print "Finding all files in dir: $dir\n"; if ( $cc_flag ) { @results = `cleartool find -all -version 'brtype($branch)' -print +2>&1`; } else { @results = `find $dir -print 2>&1`; } return @results; }

Comment on creating unknown number of threads and then join results
Download Code
Re: creating unknown number of threads and then join results
by BrowserUk (Pope) on Jul 29, 2013 at 23:42 UTC
    maybe locking the variable before I update it? I was thinking that each thread is updating a different hash key of the variable, so it should be safe to update this way? Or does it need to be locked before each join statement?

    All your updates to %Found are done within the same thread, so there is no need to lock anything. Besides which %Found isn't a shared variable, so you couldn't lock it if you tried.

    Any suggestions on how to do this better?

    Apart from this:

    foreach my $dir ( sort keys %Found ) { foreach my $item ( sort @{ $Found{$dir} } ) { push(@Final, $item); } }

    Could be more efficiently written as:

    foreach my $dir ( sort keys %Found ) { push(@Final, sort @{ $Found{$dir} }); }

    Not really. It is hard to see any scope for you not getting all the results produced by the external commands.

    Perhaps you could print out the size of @results before returning and then sum those and compare it with the size of @Final?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks for the suggestion about dumping the @results array. That did help. Well, sort of. It made me more confused, actually. In the find_thread routine, I added a line before the return:

      print "find_thread dump $dir: " . Dumper(@results) . "\n";

      I then run my script 10 times dumping the results to a 10 files. What I am finding is that the dump of @results can sometimes have output that is from another directory entirely. For example:

      find_thread dump /vobs/doc: $VAR1 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/0'; $VAR2 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/1'; $VAR3 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/2'; $VAR4 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/3'; SIZE /vobs/doc: 4 find_thread dump /vobs/drs: $VAR1 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/0'; $VAR2 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/1'; $VAR3 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/2'; $VAR4 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/3'; SIZE /vobs/drs: 4

      Which is totally confusing. @results is local to find_thread and shouldn't know anything about the other threads. What is even weirder is that when I switch to not use ClearCase find, and just find directories on the system, it seems to all work fine. So at this point, I am thinking that spawning multiple ClearCase find commands at once is causing an issue. I'll take it up with IBM.

        So at this point, I am thinking that spawning multiple ClearCase find commands at once is causing an issue. I'll take it up with IBM.

        I concur. There is nothing in your code that could account for the symptoms you are seeing, so their source can only lie with the commands you are calling.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1046925]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (12)
As of 2014-08-20 17:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (120 votes), past polls