Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Perl threads - parallel run not working

by aaron.schweitzer (Novice)
on Sep 15, 2015 at 19:19 UTC ( [id://1142124]=perlquestion: print w/replies, xml ) Need Help??

aaron.schweitzer has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I need some help with this problem. I am working on a perl code to run File::Find. Code is to try and spawn multiple threads that does a find on different mountpoints/dirs (finds files owned by specific uid). I want to spawn multiple threads so that I could reduce the time taken to search the whole system.
However the problem is when i spawn only one thread (ie, $thr1), it is able to find all files from that dir. However, when i enabled both threads, thread 1 is skipped in middle.

example When only thr1 was enabled thr2 was commented out:<br> tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l 3948 When only thr2 was enabled thr1 was commented out: tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l 1436 When both were enabled (below code): tstsrv:/home/aaron # ./collect_files_threaded.pl -u aaron | wc -l 1753 Each thread updates a global variable. I got the code generated by find2perl command to create the function f +indFiles.
use File::Find; use threads; use threads::shared; my $user :shared = $cargs{u}; my $uidNum :shared = getpwnam($user); my $user_file :shared = "/var/tmp/test_$user.file.list"; my @array_to_write :shared; my $thr1 = threads->create(\&find, \&findFiles, "/var/tmp"); my $thr2 = threads->create(\&find, \&findFiles, "/tmp"); $thr1->join(); $thr2->join(); print "Saving the details to $user_file\n"; saveFile($user_file,@array_to_write); exit; sub findFiles { my ($dev,$ino,$mode,$nlink,$uid,$gid,$file,$fstype); $file=$File::Find::name; lock(@array_to_write); (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($file)) && !($File::Find::prune |= ($dev != $File::Find::topdev)) && ($uid == "$uidNum") && push (@array_to_write,"$file:$uidNum:$gid:$mode:$fstype"); } sub saveFile{ my($file,@data)=@_; open(F,">$file") or die("Error:Failed to open $file for write:$!\n +"); print F (join("\n",@data) . "\n"); close(F); }

======================================================================

Update: Moved the parallel execution of find into multiple processes instead of threads. Each process takes care of a mount point available in the system.

foreach my $mnt (@mounts) { $script="file_collections.pl -u $user -m $mnt"; my $pid = fork; if (not defined $pid) { die 'Could not fork child'; next; } if ($pid) { $forks++; Log ("Parent process PID ($$) starting Child pid: $pid for $mn +t. \nNum of forked child processes: $forks\n"); } else { Log ("child PID ($$) --> executing \'$script\' \n"); exec ("$scr_loc/$script"); exit; } } for (my $i = 0; $i < $forks; $i++){ my $pid = wait(); Log ("$0 : $pid Completed\n"); } file_collections.pl Log ("Start Time \t\t: $datestring \n\n"); Log ("Finding the files for $user with uid $uidNum on $mount\n"); $path=$mount; $path=~s/\//_/g; $user_file = "$log_dir/$user$path.list"; $user_log_file = "$log_dir/$user$path.log"; find(\&findFiles, "$mount"); $datestring = getDateTime(); Log ("Total number of files parsed on $mount = $count\n"); Log ("Saving the details to $user_file\n\n"); Log ("End Time \t\t: $datestring\n"); saveFile($user_file,@array_to_write); saveFile($user_log_file,@log); sub findFiles { my ($dev,$ino,$mode,$nlink,$uid,$gid,$file,$fstype); $file=$File::Find::name; $fstype=findFstype("$file"); $count++; (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($file)) && !($File::Find::prune |= ($dev != $File::Find::topdev)) && ($uid == "$uidNum") && push (@array_to_write,"$file:$uidNum:$gid:$mode:$fstype"); }

Replies are listed 'Best First'.
Re: Perl threads - parallel run not working
by BrowserUk (Patriarch) on Sep 15, 2015 at 19:41 UTC

    File::Find isn't threadsafe.

    From the source code:

    # Should ideally be my() not our() but local() currently # refuses to operate on lexicals our %SLnkSeen; our ($wanted_callback, $avoid_nlink, $bydepth, $no_chdir, $follow, $follow_skip, $full_check, $untaint, $untaint_skip, $untaint_pat, $pre_process, $post_process, $dangling_symlinks);

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
      Surely File::find would be unsafe due to shared OS interaction (perhaps a shared stat buf (for "-f _" etc) or shared dir read buffer - I'm speculating here rather than looking at the perl src), rather than whether it uses lexical or package vars (both of which are private to each thread)?

      Dave.

        Surely File::find would be unsafe due to shared OS interaction (perhaps a shared stat buf (for "-f _" etc) or shared dir read buffer

        You are probably right; you usually are.

        I tried File::FInd from threads once about 7 or 8 years ago and found it to not be thread-safe. I took a brief scan inside, found it to be incredibly complex and decided that given its reliance upon global variables and other similar stuff that it wasn't worth the effort to look any further. I've never used or looked at it since.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: Perl threads - parallel run not working
by salva (Canon) on Sep 16, 2015 at 06:41 UTC
    When browsing the file system, the bottleneck usually is not the CPU but IO, specially if you have mechanical disks. Using threads may actually degrade performance as it increases disk trashing.

    So, using threads, only makes sense if you have several physical drives attached to your system and you use one thread for every one of them.

Re: Perl threads - parallel run not working
by marioroy (Prior) on Sep 16, 2015 at 08:50 UTC

    Some time ago, Yary and I exchanged emails for walking directories in parallel. The following is our collaborative effort modified for the OP task.

    use strict; use warnings; # usage: ./collect_files.pl <user> /var/tmp /tmp # output is saved under /var/tmp/test_<user>.file.list package Number; sub new { my ($class, $self) = (shift, shift || 0); bless \$self, $class; } sub decr { lock $_[0]; --${ $_[0] } } sub incr { lock $_[0]; ++${ $_[0] } } sub get { lock $_[0]; ${ $_[0] } } package Main; use threads; use threads::shared; use Thread::Queue; use File::Spec::Functions qw(catfile); use MCE; # Get user info and directories to start with my $user = shift; my $uidNum = getpwnam($user); my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; -d or die "No such directory: $_, stopped" for @start_dirs; # Shared queue and counter my $shared_work = Thread::Queue->new( @start_dirs ); my $free_count = shared_clone( Number->new(0) ); # Open output file open my $user_file, ">", "/var/tmp/test_$user.file.list" or die "cannot open file: $!"; sub traverse { $free_count->decr; my ( $dev, $ino, $mode, $nlink, $uid, $gid ); my @work = $_; while ( $_ = shift @work, defined ) { my ( $dir, $path, @paths ) = ( $_ ); opendir DH, $dir or next; for ( readdir DH ) { next if $_ eq '.' || $_ eq '..'; if ( -d ( $path = catfile($dir,$_) ) ) { ( @work < 15 ) ? push @work, $path : $shared_work->enqueue($path); next; } push @paths, $path; } for my $file ( @paths ) { if ( -f $file ) { ( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file ) +; MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" ) if $uid == $uidNum; } } } # Done with our work, let everyone know we're free $shared_work->enqueue( (undef) x MCE->max_workers ) if $free_count->incr == MCE->max_workers && !$shared_work->pendi +ng; } my $mce = MCE->new( max_workers => 8, user_begin => sub { $free_count->incr }, user_func => sub { traverse() while ($_ = $shared_work->dequeue, de +fined) }, ); $mce->run; close $user_file;

    Kind regards, Mario

      Updated September 5, 2016: Modified code to run with MCE 1.8 / MCE::Shared 1.8.

      The following runs with Perl not built with threads support.

      use strict; use warnings; # usage: ./collect_files.pl <user> /var/tmp /tmp # output is saved under /var/tmp/test_<user>.file.list use File::Spec::Functions qw(catfile); use MCE; use MCE::Queue; use MCE::Shared; # Get user info and directories to start with my $user = shift; my $uidNum = getpwnam($user); my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; -d or die "No such directory: $_, stopped" for @start_dirs; # Shared queue and counter my $shared_work = MCE::Queue->new( fast => 1, queue => \@start_dirs ); my $free_count = MCE::Shared->scalar( 0 ); # Open output file open my $user_file, ">", "/var/tmp/test_$user.file.list" or die "cannot open file: $!"; sub traverse { $free_count->decr; my ( $dev, $ino, $mode, $nlink, $uid, $gid ); my @work = $_; while ( $_ = shift @work, defined ) { my ( $dir, $path, @paths ) = ( $_ ); opendir DH, $dir or next; for ( readdir DH ) { next if $_ eq '.' || $_ eq '..'; if ( -d ( $path = catfile($dir,$_) ) ) { ( @work < 15 ) ? push @work, $path : $shared_work->enqueue($path); next; } push @paths, $path; } for my $file ( @paths ) { if ( -f $file ) { ( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file ) +; MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" ) if $uid == $uidNum; } } } # Done with our work, let everyone know we're free $shared_work->enqueue( (undef) x MCE->max_workers ) if $free_count->incr == MCE->max_workers && !$shared_work->pendi +ng; } my $mce = MCE->new( max_workers => 8, user_begin => sub { $free_count->incr }, user_func => sub { traverse() while ($_ = $shared_work->dequeue, de +fined) }, ); $mce->run; close $user_file;

      Kind regards, Mario

      Processing large directory trees in parallel is an interesting problem, as "a directory with huge number of files" could be split up differently than "a huge number of branches" depending on one's approach. The method above should work with either (and is even more general than Gnu "parallel").

      File::Find can be made to work in parallel, but it is ugly. Cleaner to start new as we did here. I keep meaning to blog about this alternate solution (have post half-written) and turn it into a CPAN module...

Re: Perl threads - parallel run not working
by Anonymous Monk on Sep 15, 2015 at 23:33 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1142124]
Approved by Albannach
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2024-05-20 15:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found