aaron.schweitzer has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
I need some help with this problem.
I am working on a perl code to run File::Find. Code is to try and spawn multiple threads that does a find on different mountpoints/dirs (finds files owned by specific uid). I want to spawn multiple threads so that I could reduce the time taken to search the whole system.
However the problem is when i spawn only one thread (ie, $thr1), it is able to find all files from that dir. However, when i enabled both threads, thread 1 is skipped in middle.
example
When only thr1 was enabled thr2 was commented out:<br>
tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l
3948
When only thr2 was enabled thr1 was commented out:
tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l
1436
When both were enabled (below code):
tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l
1753
Each thread updates a global variable.
I got the code generated by find2perl command to create the function f
+indFiles.
use File::Find;
use threads;
use threads::shared;
my $user :shared = $cargs{u};
my $uidNum :shared = getpwnam($user);
my $user_file :shared = "/var/tmp/test_$user.file.list";
my @array_to_write :shared;
my $thr1 = threads->create(\&find, \&findFiles, "/var/tmp");
my $thr2 = threads->create(\&find, \&findFiles, "/tmp");
$thr1->join();
$thr2->join();
print "Saving the details to $user_file\n";
saveFile($user_file,@array_to_write);
exit;
sub findFiles {
my ($dev,$ino,$mode,$nlink,$uid,$gid,$file,$fstype);
$file=$File::Find::name;
lock(@array_to_write);
(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($file)) &&
!($File::Find::prune |= ($dev != $File::Find::topdev)) &&
($uid == "$uidNum") &&
push (@array_to_write,"$file:$uidNum:$gid:$mode:$fstype");
}
sub saveFile{
my($file,@data)=@_;
open(F,">$file") or die("Error:Failed to open $file for write:$!\n
+");
print F (join("\n",@data) . "\n");
close(F);
}
======================================================================
Update:
Moved the parallel execution of find into multiple processes instead of threads. Each process takes care of a mount point available in the system.
foreach my $mnt (@mounts) {
$script="file_collections.pl -u $user -m $mnt";
my $pid = fork;
if (not defined $pid) {
die 'Could not fork child';
next;
}
if ($pid) {
$forks++;
Log ("Parent process PID ($$) starting Child pid: $pid for $mn
+t. \nNum of forked child processes: $forks\n");
}
else {
Log ("child PID ($$) --> executing \'$script\' \n");
exec ("$scr_loc/$script");
exit;
}
}
for (my $i = 0; $i < $forks; $i++){
my $pid = wait();
Log ("$0 : $pid Completed\n");
}
file_collections.pl
Log ("Start Time \t\t: $datestring \n\n");
Log ("Finding the files for $user with uid $uidNum on $mount\n");
$path=$mount;
$path=~s/\//_/g;
$user_file = "$log_dir/$user$path.list";
$user_log_file = "$log_dir/$user$path.log";
find(\&findFiles, "$mount");
$datestring = getDateTime();
Log ("Total number of files parsed on $mount = $count\n");
Log ("Saving the details to $user_file\n\n");
Log ("End Time \t\t: $datestring\n");
saveFile($user_file,@array_to_write);
saveFile($user_log_file,@log);
sub findFiles {
my ($dev,$ino,$mode,$nlink,$uid,$gid,$file,$fstype);
$file=$File::Find::name;
$fstype=findFstype("$file");
$count++;
(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($file)) &&
!($File::Find::prune |= ($dev != $File::Find::topdev)) &&
($uid == "$uidNum") &&
push (@array_to_write,"$file:$uidNum:$gid:$mode:$fstype");
}
Re: Perl threads - parallel run not working
by BrowserUk (Patriarch) on Sep 15, 2015 at 19:41 UTC
|
# Should ideally be my() not our() but local() currently
# refuses to operate on lexicals
our %SLnkSeen;
our ($wanted_callback, $avoid_nlink, $bydepth, $no_chdir, $follow,
$follow_skip, $full_check, $untaint, $untaint_skip, $untaint_pat,
$pre_process, $post_process, $dangling_symlinks);
| [reply] [d/l] |
|
Surely File::find would be unsafe due to shared OS interaction (perhaps a shared stat buf (for "-f _" etc) or shared dir read buffer - I'm speculating here rather than looking at the perl src), rather than whether it uses lexical or package vars (both of which are private to each thread)?
Dave.
| [reply] |
|
Surely File::find would be unsafe due to shared OS interaction (perhaps a shared stat buf (for "-f _" etc) or shared dir read buffer
You are probably right; you usually are.
I tried File::FInd from threads once about 7 or 8 years ago and found it to not be thread-safe. I took a brief scan inside, found it to be incredibly complex and decided that given its reliance upon global variables and other similar stuff that it wasn't worth the effort to look any further. I've never used or looked at it since.
| [reply] |
|
Re: Perl threads - parallel run not working
by salva (Canon) on Sep 16, 2015 at 06:41 UTC
|
When browsing the file system, the bottleneck usually is not the CPU but IO, specially if you have mechanical disks. Using threads may actually degrade performance as it increases disk trashing.
So, using threads, only makes sense if you have several physical drives attached to your system and you use one thread for every one of them. | [reply] |
Re: Perl threads - parallel run not working
by marioroy (Prior) on Sep 16, 2015 at 08:50 UTC
|
Some time ago, Yary and I exchanged emails for walking directories in parallel. The following is our collaborative effort modified for the OP task.
use strict;
use warnings;
# usage: ./collect_files.pl <user> /var/tmp /tmp
# output is saved under /var/tmp/test_<user>.file.list
package Number;
sub new {
my ($class, $self) = (shift, shift || 0);
bless \$self, $class;
}
sub decr { lock $_[0]; --${ $_[0] } }
sub incr { lock $_[0]; ++${ $_[0] } }
sub get { lock $_[0]; ${ $_[0] } }
package Main;
use threads;
use threads::shared;
use Thread::Queue;
use File::Spec::Functions qw(catfile);
use MCE;
# Get user info and directories to start with
my $user = shift;
my $uidNum = getpwnam($user);
my @start_dirs = @ARGV
or die "Please specify one or more starting directories, stopped";
-d or die "No such directory: $_, stopped" for @start_dirs;
# Shared queue and counter
my $shared_work = Thread::Queue->new( @start_dirs );
my $free_count = shared_clone( Number->new(0) );
# Open output file
open my $user_file, ">", "/var/tmp/test_$user.file.list"
or die "cannot open file: $!";
sub traverse {
$free_count->decr;
my ( $dev, $ino, $mode, $nlink, $uid, $gid );
my @work = $_;
while ( $_ = shift @work, defined ) {
my ( $dir, $path, @paths ) = ( $_ );
opendir DH, $dir or next;
for ( readdir DH ) {
next if $_ eq '.' || $_ eq '..';
if ( -d ( $path = catfile($dir,$_) ) ) {
( @work < 15 )
? push @work, $path
: $shared_work->enqueue($path);
next;
}
push @paths, $path;
}
for my $file ( @paths ) {
if ( -f $file ) {
( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file )
+;
MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" )
if $uid == $uidNum;
}
}
}
# Done with our work, let everyone know we're free
$shared_work->enqueue( (undef) x MCE->max_workers )
if $free_count->incr == MCE->max_workers && !$shared_work->pendi
+ng;
}
my $mce = MCE->new(
max_workers => 8,
user_begin => sub { $free_count->incr },
user_func => sub { traverse() while ($_ = $shared_work->dequeue, de
+fined) },
);
$mce->run;
close $user_file;
Kind regards, Mario | [reply] [d/l] |
|
Updated September 5, 2016: Modified code to run with MCE 1.8 / MCE::Shared 1.8.
The following runs with Perl not built with threads support.
use strict;
use warnings;
# usage: ./collect_files.pl <user> /var/tmp /tmp
# output is saved under /var/tmp/test_<user>.file.list
use File::Spec::Functions qw(catfile);
use MCE;
use MCE::Queue;
use MCE::Shared;
# Get user info and directories to start with
my $user = shift;
my $uidNum = getpwnam($user);
my @start_dirs = @ARGV
or die "Please specify one or more starting directories, stopped";
-d or die "No such directory: $_, stopped" for @start_dirs;
# Shared queue and counter
my $shared_work = MCE::Queue->new( fast => 1, queue => \@start_dirs );
my $free_count = MCE::Shared->scalar( 0 );
# Open output file
open my $user_file, ">", "/var/tmp/test_$user.file.list"
or die "cannot open file: $!";
sub traverse {
$free_count->decr;
my ( $dev, $ino, $mode, $nlink, $uid, $gid );
my @work = $_;
while ( $_ = shift @work, defined ) {
my ( $dir, $path, @paths ) = ( $_ );
opendir DH, $dir or next;
for ( readdir DH ) {
next if $_ eq '.' || $_ eq '..';
if ( -d ( $path = catfile($dir,$_) ) ) {
( @work < 15 )
? push @work, $path
: $shared_work->enqueue($path);
next;
}
push @paths, $path;
}
for my $file ( @paths ) {
if ( -f $file ) {
( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file )
+;
MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" )
if $uid == $uidNum;
}
}
}
# Done with our work, let everyone know we're free
$shared_work->enqueue( (undef) x MCE->max_workers )
if $free_count->incr == MCE->max_workers && !$shared_work->pendi
+ng;
}
my $mce = MCE->new(
max_workers => 8,
user_begin => sub { $free_count->incr },
user_func => sub { traverse() while ($_ = $shared_work->dequeue, de
+fined) },
);
$mce->run;
close $user_file;
Kind regards, Mario | [reply] [d/l] |
|
| [reply] |
Re: Perl threads - parallel run not working
by Anonymous Monk on Sep 15, 2015 at 23:33 UTC
|
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|