Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: use threads for dir tree walking really hurts

by Corion (Patriarch)
on Aug 31, 2016 at 13:29 UTC ( [id://1170882]=note: print w/replies, xml ) Need Help??


in reply to use threads for dir tree walking really hurts

use Devel::Pointer; ... my $obj = deref ( $addr ) ; my $root = shift ( @{$obj->{dirToFetch}} ) ;

Why are you doing that?

Perl is not C and you don't need to step outside the Perl datatypes to handle data access from multiple threads within Perl.

The following should be the equivalent of what you do, except far saner and not needing Devel::Pointer:

... sub _walk { my $obj = shift; my $root = shift ( @{$obj->{dirToFetch}} ) ; $obj -> _fetchDir ( $root ) ; } ... threads->create ( '_walk' , $self ) -> join;

Note that you do not even start running multiple threads in the above because you spawn a separate thread but don't continue until it has finished its work. Most likely, a better approach is to store all threads and then wait for them to finish:

... push my @running, threads->create ( '_walk' , $self ); ... while( @running ) { my $next = shift @running; $next->join; };

Personally, I recommend using Thread::Queue and a worker pool to handle a workload because starting a Perl thread is relatively resource intensive. I'm not sure that using multiple threads will bring you much benefit, as I think your operation largely is limited by the network or the HD (or filesystem) performance.

Thinking more about it, I guess that a somewhat better approach is to have all directories to crawl stored in a Thread::Queue and to have threads fetch from that whenever they need to crawl a new directory. For output, I would use another Thread::Queue, just for simplicissity (roughly adapted from here:

#! perl -slw use strict; use threads; use Thread::Queue; my $directories = Thread::Queue->new(); my $files = Thread::Queue->new(); use vars '$NUM_CPUS'; $NUM_CPUS ||= 4; sub _walk { while( defined my $dir = $directories->dequeue) {; my @entries = ...; for my $e (@entries) { if( -d $e ) { # depth-first search $directories->insert(0, $e); } else { # It would be much faster to enqueue all files in bulk + instead # of enqueueing them one by one, but first get it work +ing before # you make it fast $files->enqueue( $e ); }; }; }; } $directories->enqueue( @ARGV ); for ( 1..$NUM_CPUS ) { threads->new( \&_walk )->detach; }; print while defined( $_ = $files->dequeue ); print 'Done';

Replies are listed 'Best First'.
Re^2: use threads for dir tree walking really hurts
by exilepanda (Friar) on Sep 01, 2016 at 13:55 UTC
    Thank you very much for your vivid elaboration which is very inspiring. =D
    Why are you doing that?
    Because when an object fall into a thread scope, the object will be cloned, which is not the one I want. And since threads don't share object / complex data structure ( and I don't what to share them one by one ), this trick do share the object perfectly... until it's not.

    Actually, I can do the job with simply: @dirToFetch : shared, but same issue Thread::Queue, I gotta leave it at a nested package scope , but create it inside an object become another mess to share around threads. Because I attempt to make it a module, so I hope to avoid if other script calling this module in threads, the data will mess up.

    Though, I've update my OP's code, which will work and as fast as dir /s/b I create as many threads as how much in @dirToFetch

      Why are you doing that?
      Because when an object fall into a thread scope, the object will be cloned, which is not the one I want. And since threads don't share object / complex data structure ( and I don't what to share them one by one ), this trick do share the object perfectly... until it's not.

      Yes - due to Perls reference counting, accessing variables in another threads memory always means that your thread will also be writing at least to the refcount field of that variable. If the refcount happens to reach zero in another thread than where the piece of memory was originally allocated, the memory will be freed in the wrong thread context, which is not fun.

      I'm not aware of a way to make Perl skip its refcounting for variables, and I'm also not convinced that this could work except in the most trivial cases.

      Another idea to reduce the conceptual load of the appriach might be to simply shell out to cmd /c "dir /b /s $directory", but then you need to be aware of the codepage that cmd.exe uses for its output. Ideally you have set the codepage to Unicode / 65001:

      chcp 65001

      ... but then, you still have to live with the fun of Perl and the OS treating the octets for filenames differently unless you properly decode and encode them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1170882]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (8)
As of 2024-04-18 08:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found