Perl threads - parallel run not working

aaron.schweitzer has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I need some help with this problem. I am working on a perl code to run File::Find. Code is to try and spawn multiple threads that does a find on different mountpoints/dirs (finds files owned by specific uid). I want to spawn multiple threads so that I could reduce the time taken to search the whole system.
However the problem is when i spawn only one thread (ie, $thr1), it is able to find all files from that dir. However, when i enabled both threads, thread 1 is skipped in middle.

example

When only thr1 was enabled thr2 was commented out:<br>
tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l
3948

When only thr2 was enabled thr1 was commented out:
tstsrv:/home/aaron  # ./collect_files_threaded.pl -u aaron | wc -l
1436

When both were enabled (below code):
tstsrv:/home/aaron  # ./collect_files_threaded.pl -u aaron | wc -l
1753

Each thread updates a global variable.

I got the code generated by find2perl command to create the function f
+indFiles.
[download]

use File::Find;
use threads;
use threads::shared;

my $user :shared = $cargs{u};
my $uidNum :shared = getpwnam($user);
my $user_file :shared = "/var/tmp/test_$user.file.list";
my @array_to_write :shared;

my $thr1 = threads->create(\&find, \&findFiles, "/var/tmp");
my $thr2 = threads->create(\&find, \&findFiles, "/tmp");
$thr1->join();
$thr2->join();


print "Saving the details to $user_file\n";
saveFile($user_file,@array_to_write);

exit;

sub findFiles {
    my ($dev,$ino,$mode,$nlink,$uid,$gid,$file,$fstype);
    $file=$File::Find::name;
    
        lock(@array_to_write);
        
        
        (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($file)) &&
        !($File::Find::prune |= ($dev != $File::Find::topdev)) &&
        ($uid == "$uidNum") && 
        push (@array_to_write,"$file:$uidNum:$gid:$mode:$fstype");
    
}

sub saveFile{
    my($file,@data)=@_;
    open(F,">$file") or die("Error:Failed to open $file for write:$!\n
+");
    print F (join("\n",@data) . "\n");
    close(F);
}
[download]

======================================================================

Update: Moved the parallel execution of find into multiple processes instead of threads. Each process takes care of a mount point available in the system.

foreach my $mnt (@mounts) {
    $script="file_collections.pl -u $user -m $mnt";
    my $pid = fork;
    if (not defined $pid) {
        die 'Could not fork child';
        next;
    }
    if ($pid) {
        $forks++;
        Log ("Parent process PID ($$) starting Child pid: $pid for $mn
+t. \nNum of forked child processes: $forks\n");
    }
    else {
        Log ("child PID ($$) --> executing \'$script\' \n"); 
        exec ("$scr_loc/$script");
        exit;
    }
}

for (my $i = 0; $i < $forks; $i++){
    my $pid = wait();
    Log ("$0 : $pid Completed\n");
}



file_collections.pl

Log ("Start Time \t\t: $datestring \n\n");
Log ("Finding the files for $user with uid $uidNum on $mount\n");

$path=$mount; 
$path=~s/\//_/g;

$user_file = "$log_dir/$user$path.list";
$user_log_file = "$log_dir/$user$path.log";

find(\&findFiles, "$mount");

$datestring = getDateTime();

Log ("Total number of files parsed on $mount = $count\n");
Log ("Saving the details to $user_file\n\n");
Log ("End Time \t\t: $datestring\n");

saveFile($user_file,@array_to_write);
saveFile($user_log_file,@log);


sub findFiles {
    my ($dev,$ino,$mode,$nlink,$uid,$gid,$file,$fstype);
    $file=$File::Find::name;
    $fstype=findFstype("$file");
    $count++;        

    (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($file)) &&
    !($File::Find::prune |= ($dev != $File::Find::topdev)) &&
    ($uid == "$uidNum") && 
    push (@array_to_write,"$file:$uidNum:$gid:$mode:$fstype");

    
}
[download]

Comment on Perl threads - parallel run not working Select or Download Code

Replies are listed 'Best First'.
Re: Perl threads - parallel run not working by BrowserUk (Patriarch) on Sep 15, 2015 at 19:41 UTC
File::Find isn't threadsafe. From the source code: `# Should ideally be my() not our() but local() currently # refuses to operate on lexicals our %SLnkSeen; our ($wanted_callback, $avoid_nlink, $bydepth, $no_chdir, $follow, $follow_skip, $full_check, $untaint, $untaint_skip, $untaint_pat, $pre_process, $post_process, $dangling_symlinks);` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l]
Re^2: Perl threads - parallel run not working by dave_the_m (Monsignor) on Sep 15, 2015 at 21:25 UTC
Surely File::find would be unsafe due to shared OS interaction (perhaps a shared stat buf (for "-f _" etc) or shared dir read buffer - I'm speculating here rather than looking at the perl src), rather than whether it uses lexical or package vars (both of which are private to each thread)? Dave.	[reply]
Re^3: Perl threads - parallel run not working by BrowserUk (Patriarch) on Sep 15, 2015 at 23:26 UTC
Surely File::find would be unsafe due to shared OS interaction (perhaps a shared stat buf (for "-f _" etc) or shared dir read buffer You are probably right; you usually are. I tried File::FInd from threads once about 7 or 8 years ago and found it to not be thread-safe. I took a brief scan inside, found it to be incredibly complex and decided that given its reliance upon global variables and other similar stuff that it wasn't worth the effort to look any further. I've never used or looked at it since. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply]
Re^4: Perl threads - parallel run not working by aaron.schweitzer (Novice) on Sep 16, 2015 at 05:52 UTC
Re: Perl threads - parallel run not working by salva (Canon) on Sep 16, 2015 at 06:41 UTC
When browsing the file system, the bottleneck usually is not the CPU but IO, specially if you have mechanical disks. Using threads may actually degrade performance as it increases disk trashing. So, using threads, only makes sense if you have several physical drives attached to your system and you use one thread for every one of them.	[reply]
Re: Perl threads - parallel run not working by marioroy (Prior) on Sep 16, 2015 at 08:50 UTC
Some time ago, Yary and I exchanged emails for walking directories in parallel. The following is our collaborative effort modified for the OP task. use strict; use warnings; # usage: ./collect_files.pl <user> /var/tmp /tmp # output is saved under /var/tmp/test_<user>.file.list package Number; sub new { my ($class, $self) = (shift, shift \|\| 0); bless \$self, $class; } sub decr { lock $_[0]; --${ $_[0] } } sub incr { lock $_[0]; ++${ $_[0] } } sub get { lock $_[0]; ${ $_[0] } } package Main; use threads; use threads::shared; use Thread::Queue; use File::Spec::Functions qw(catfile); use MCE; # Get user info and directories to start with my $user = shift; my $uidNum = getpwnam($user); my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; -d or die "No such directory: $_, stopped" for @start_dirs; # Shared queue and counter my $shared_work = Thread::Queue->new( @start_dirs ); my $free_count = shared_clone( Number->new(0) ); # Open output file open my $user_file, ">", "/var/tmp/test_$user.file.list" or die "cannot open file: $!"; sub traverse { $free_count->decr; my ( $dev, $ino, $mode, $nlink, $uid, $gid ); my @work = $_; while ( $_ = shift @work, defined ) { my ( $dir, $path, @paths ) = ( $_ ); opendir DH, $dir or next; for ( readdir DH ) { next if $_ eq '.' \|\| $_ eq '..'; if ( -d ( $path = catfile($dir,$_) ) ) { ( @work < 15 ) ? push @work, $path : $shared_work->enqueue($path); next; } push @paths, $path; } for my $file ( @paths ) { if ( -f $file ) { ( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file ) +; MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" ) if $uid == $uidNum; } } } # Done with our work, let everyone know we're free $shared_work->enqueue( (undef) x MCE->max_workers ) if $free_count->incr == MCE->max_workers && !$shared_work->pendi +ng; } my $mce = MCE->new( max_workers => 8, user_begin => sub { $free_count->incr }, user_func => sub { traverse() while ($_ = $shared_work->dequeue, de +fined) }, ); $mce->run; close $user_file; [download] Kind regards, Mario	[reply] [d/l]
Re^2: Perl threads - parallel run not working by marioroy (Prior) on Sep 16, 2015 at 08:57 UTC
Updated September 5, 2016: Modified code to run with MCE 1.8 / MCE::Shared 1.8. The following runs with Perl not built with threads support. use strict; use warnings; # usage: ./collect_files.pl <user> /var/tmp /tmp # output is saved under /var/tmp/test_<user>.file.list use File::Spec::Functions qw(catfile); use MCE; use MCE::Queue; use MCE::Shared; # Get user info and directories to start with my $user = shift; my $uidNum = getpwnam($user); my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; -d or die "No such directory: $_, stopped" for @start_dirs; # Shared queue and counter my $shared_work = MCE::Queue->new( fast => 1, queue => \@start_dirs ); my $free_count = MCE::Shared->scalar( 0 ); # Open output file open my $user_file, ">", "/var/tmp/test_$user.file.list" or die "cannot open file: $!"; sub traverse { $free_count->decr; my ( $dev, $ino, $mode, $nlink, $uid, $gid ); my @work = $_; while ( $_ = shift @work, defined ) { my ( $dir, $path, @paths ) = ( $_ ); opendir DH, $dir or next; for ( readdir DH ) { next if $_ eq '.' \|\| $_ eq '..'; if ( -d ( $path = catfile($dir,$_) ) ) { ( @work < 15 ) ? push @work, $path : $shared_work->enqueue($path); next; } push @paths, $path; } for my $file ( @paths ) { if ( -f $file ) { ( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file ) +; MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" ) if $uid == $uidNum; } } } # Done with our work, let everyone know we're free $shared_work->enqueue( (undef) x MCE->max_workers ) if $free_count->incr == MCE->max_workers && !$shared_work->pendi +ng; } my $mce = MCE->new( max_workers => 8, user_begin => sub { $free_count->incr }, user_func => sub { traverse() while ($_ = $shared_work->dequeue, de +fined) }, ); $mce->run; close $user_file; [download] Kind regards, Mario	[reply] [d/l]
Re^2: Perl threads - parallel run not working by Yary (Pilgrim) on Sep 16, 2015 at 15:21 UTC
Processing large directory trees in parallel is an interesting problem, as "a directory with huge number of files" could be split up differently than "a huge number of branches" depending on one's approach. The method above should work with either (and is even more general than Gnu "parallel"). File::Find can be made to work in parallel, but it is ugly. Cleaner to start new as we did here. I keep meaning to blog about this alternate solution (have post half-written) and turn it into a CPAN module...	[reply]
Re: Perl threads - parallel run not working by Anonymous Monk on Sep 15, 2015 at 23:33 UTC
There is warning in http://perldoc.perl.org/threads.html#Current-working-directory so you should use File::Find no_chdir => 1, option as chdir is global on linux, all threads share one chdir	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


go ahead... be a heretic
	PerlMonks