Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Function to sweep a file tree

by vr (Curate)
on Jun 17, 2020 at 06:13 UTC ( #11118172=note: print w/replies, xml ) Need Help??

in reply to Function to sweep a file tree

If speed is important, prefer readdir over glob (3d run results):

use strict; use warnings; use feature 'say'; use Data::Dump 'dd'; use Time::HiRes 'time'; use File::Glob ':bsd_glob'; use Fcntl ':mode'; use File::stat; use Win32::LongPath; STDOUT-> autoflush; my $dir = 'c:/program files'; { print "testing glob... "; my $t = time; my @to_do = ( $dir ); my @result; while ( my $item = shift @to_do ) { my $stat = stat $item; next unless $stat; my $mode = $stat-> mode; if ( $mode & S_IFREG ) { push @result, [ $item, $stat-> size ]; } elsif ( $mode & S_IFDIR ) { unshift @to_do, grep { !m{ /\.{1,2}$ }x } bsd_glob( "$item/{.,}*" ) } } printf "%d files, %.03f s\n", scalar( @result ), time - $t; } { print "testing readdir... "; my $t = time; my @to_do = ( $dir ); my @result; while ( my $item = shift @to_do ) { my $stat = statL $item; next unless $stat; if ( $stat-> { mode } & S_IFREG ) { push @result, [ $item, $stat-> { size }]; } elsif ( $stat-> { mode } & S_IFDIR ) { my $d = Win32::LongPath-> new; $d-> opendirL( $item ) or next; unshift @to_do, map { "$item/$_" } grep { !m{ ^\.{1,2}$ }x } $d-> readdirL; } } printf "%d files, %.03f s\n", scalar( @result ), time - $t; } __END__ testing glob... 18670 files, 3.492 s testing readdir... 18670 files, 1.863 s

Sorry I've re-written your code completely, it was for investigation only. (One minor complaint may be that grep {} glob(), glob() looks like (grep {} glob()), glob() was intended, but this complaint is irrelevant to results). Also irrelevant (to speed) and maybe distracting are details which have happened in final script (which is not too DRY to begin with): bsd_glob, File::stat, no file tests as such, and, also, use of Win32::LongPath itself. The latter is slightly slower than opendir/readdir, and if trees are grown in controlled environment, not really necessary.

I suspect the explanation is glob performs stat on produced items (as File::Find does, if I'm not mistaken), it can't be so much slower because of strings manipulation only. BTW, I observe similar difference on Linux.

There's a cheat in that same number of files was neatly reported above -- but lists may not be the same, your result may have differing numbers, I get differing numbers for e.g. c:\users. I didn't investigate if it's access rights issues, or some specially treated magic directories on Windows, or links, etc. By that time I already discarded all error logging :). Maybe not important if trees are grown in data files land.

Replies are listed 'Best First'.
Re^2: Function to sweep a file tree
by bojinlund (Prior) on Jun 21, 2020 at 06:21 UTC

    Thanks vr ! It has really helped me a lot!

    I've re-written your code completely ...

    OK, Good!

    1) I rewrote my FS_sweep based on your proposal

    my $d = Win32::LongPath-> new; $d-> readdirL;
    Doing stress tests using C:, I got a number of problems. The script sometimes works. But often it stucks, loops and was difficult to kill. Had to use the ctrl-alt-del/activity handler to stop it. I think that the memory is overwrite by  $d-> readdirL;. The largest directory I have found returned by  $d-> readdirL; has 38252 entries. MS File Explorer says: 38250 objekt

    2) I rewrote my FS_sweep using while ( my $name = $dir->readdirL() )

    Below follows a script which can be used to test this approach.

    This works much better. But there are still problems. The script is sometimes stuck (no cpu time is used) or looping (cpu time is used, but nothing is happening). When logging the found file pathes to a file, the frequency of the problem seem to be lower. There is probably some type of timing problem in  readdirL. By accessing C:/Users the problem is rather frequent. Sweeping less complicated file structure as C: seem to be OK!

    File path like <C:/Users/bo/Application Data/> are sometimes returned by readdirL()!?

    Here are some results

    dir: C: #dirs: 79923 #files: 305816 #nodes: 385739 1/s: 5749 dir: D: #dirs: 7776 #files: 115255 #nodes: 123031 1/s: 14499 dir: Q: #dirs: 907 #files: 16095 #nodes: 17002 1/s: 12374 dir: C: #dirs: 67183 #depth: 13 #files: 259099 #nodes: 326282 1/s: +5558 (skipping 'C:/Users')

    The in C:/Windows found number of files are 193 less than shown by the MS File Explorer and for directories 36 less.

    In the documentation of Win32::LongPath there is in one of the examples

    # recurse if dir if (($file ne '.') && (($stat->{attribs} & (FILE_ATTRIBUTE_DIRECTORY | FILE_ATTRIBUTE_REPARSE_POINT)) == FILE_ATTRIBUTE_DIRECTORY)) { search_tree ($name); next; }
    What does $stat->{mode} & S_IFDIR correspond to?

    Here is my test script:

    use strict; use warnings; use 5.010; use Path::Tiny qw( path ); use Data::Dump qw(dump dd ddx); use Win32::LongPath; use File::stat; use Fcntl ':mode'; use Benchmark qw(:all); binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my @dir_skip = ( '$RECYCLE.BIN', 'System Volume Information', 'Config.Msi', 'C:/Use +rs' #, 'C:/AMD', 'C:/hp' ); my $dir_skip = join '|', map { quotemeta } @dir_skip; my $dir_skip_regexp = qr {$dir_skip$}; sub do_dir { my $dir_path = shift; my $sub_ref = shift; # callback my $dir = Win32::LongPath->new; unless ( $dir->opendirL($dir_path) ) { warn "!! unable to open $dir_path ($^E)"; return; } my @dir_name; while ( my $name = $dir->readdirL() ) { if ( $name =~ m{ ^[.]{1,2}$ }x ) { next; } my $path = "$dir_path/$name"; my $stat = lstatL($path); if ( !defined $stat ) { next if $^E =~ /tkomst nekad/; warn "!! SKIP $^E <$path>"; next; } if ( $stat->{mode} & S_IFREG ) { # normal file $sub_ref->( $path, $stat ); # call callback } elsif ( $stat->{mode} & S_IFDIR ) { # dir push @dir_name, $name; } else { warn "!! ? $name"; } } return \@dir_name; } { my @to_do; my $max_depth; sub td_to_txt { dump @to_do; } sub td_clear { @to_do = (); $max_depth = 0; } sub td_down { push @to_do, []; my $depth = @to_do; $max_depth = $depth if $depth > $max_depth; } sub max_depth { return $max_depth } sub td_add { my $name = shift; @to_do = [] unless (@to_do); push @{ $to_do[-1] }, $name; } sub td_add_aref { my $dir_aref = shift; push @{ $to_do[-1] }, @$dir_aref; } sub td_path_next { return join '/', map { $_->[0] } @to_do; } sub td_remove_dir { my $aref = $to_do[-1]; my $removed = shift @$aref; # remove dir return if @$aref; # more dirs while ( $aref = $to_do[-1] ) { if ( !@$aref ) { $removed = pop @to_do; # remove level next; } $removed = shift @{ $to_do[-1] }; # remove dir return if @$aref; # more dirs } } } sub FS_sweep { my $dir_path = shift; my $sub_ref = shift; td_clear; td_down; td_add($dir_path); my $dir_cnt = 0; my $t0 = Benchmark->new; while ( my $dir_path = td_path_next ) { if ( $dir_path =~ m{$dir_skip_regexp} ) { warn "SKIPING DIR $dir_path"; td_remove_dir; next; } $dir_cnt++; my $dir_name_aref = do_dir( $dir_path, $sub_ref ); my $sub_dir_nof = @$dir_name_aref; if ( $sub_dir_nof > 1000 ) { warn "!! MANY SUBDIR $sub_dir_nof in $dir_path"; } if (@$dir_name_aref) { # subdir td_down; td_add_aref($dir_name_aref); } else { warn '!! ! defined $dir_name_aref' if !defined $dir_name_a +ref; td_remove_dir; } } my $td = timediff( Benchmark->new, $t0 ); return $dir_path, $dir_cnt, max_depth, $td; } my @output; my $file_cnt = 0; sub file_log { my $file_path = shift; $file_cnt++; warn "!# $file_cnt $file_path\n" if not $file_cnt % 10000; } sub summary { my $dir_path = shift; my $dir_cnt = shift; my $max_depth = shift; my $td = shift; my $node_cnt = $dir_cnt + $file_cnt; my $node_per_second = $td->cpu_p > 0 ? $node_cnt / $td->cpu_p : -1 +; my $txt = sprintf "\n\n!! FS_sweep summary dir: %s\n #dirs: %d #depth: %d #files: %d # +nodes: %d 1/s: %d\n", $dir_path, $dir_cnt, $max_depth, $file_cnt, $node_cnt, $node_per +_second; $file_cnt = 0; return $txt; } my $ls_log = 1; # activate listing of files in 'ls_log.txt' my $log_fh; $log_fh = path('ls_log.txt')->openw_utf8 if $ls_log; sub FS_file_big { my $file_path = shift; my $stat_hash_ref = shift; file_log($file_path); say {$log_fh} $file_path if $ls_log; my $size = $stat_hash_ref->{size}; push @output, "BIG $file_path size: $size\n" if $size > 100000000; } sub output { if (@output) { say "Output:"; say map { "$_\n" } grep { defined } @output[ 0 .. 100, 1000 .. 1010, 2000 .. 20 +10 ]; say "END Output\n"; @output = (); } STDOUT->flush; } say summary( FS_sweep( 'C:/Windows', \&FS_file_big ) ); output; say summary( FS_sweep( 'C:', \&FS_file_big ) ); output; foreach my $dev (qw{ }) { # add C D ... warn "!! START $dev: =======================================\n"; say summary( FS_sweep( "$dev:", \&FS_file_big ) ); output; }

      Hi, glad to hear your script was improved. To your questions:

      (1) Re: S_IFDIR. Easy to check: as I see there are 2 entries/directories in my C:/Users tree with FILE_ATTRIBUTE_REPARSE_POINT bit set; for both S_IFDIR is also set. So, to strictly follow Win32::LongPath documentation as to what to treat as "directory" i.e. exclude such entries, -- use fragment you quoted from documentation. Stress tests are good, but my impression was your tree to monitor should be known beforehand, whether it can or cannot contain reparse points. And, I meant the same thing (about "controlled environment"), when I said built-in readdir might be enough vs. readdirL, if a tree is known to be grown by native speakers/users of single (system) Windows code page. For versatility, sure, Win32::LongPath should be preferred.

      (2) Re: <C:/Users/bo/Application Data/> - I don't understand (was this a question/problem?)

      About freezes: that's not good. I can repeatedly sweep C:/Users and C:/Windows without issues, both with my and your scripts. At first, I suspected iterative readdirL (furthermore interspersed with lstatL calls) might be the reason, but, no, your script runs OK here. Still, to debug, maybe try to switch from iterative use to list context call. BTW, I don't observe any noticeable speed difference.

      Lastly, about different total results of (various) Perl techniques and/or what Explorer reports: if you really want to pursue to the core, there's dichotomy with extensive logging, but of course you know the method.

        About freezes:

        I have made a script with
        foreach my $done ( 1 .. 100 ) { FS_sweep( 'C:', \&FS_file_big ); ...

        Running this script several times in Emacs, using the compile command, all resulted in a freeze. The the number of calls to FS_sweep before freeze was between 0 and 91.

        Running the script in the command shell resulted in 0 to 14 succeeded calls to FS_sweep. In most cases none.

        <C:/Users/bo/Application Data/> (was this a question/problem?)

        use strict; use warnings; use 5.010; use Data::Dump qw(dump dd ddx); use Win32::LongPath; use Fcntl ':mode'; sub do_readdirL_arr { my $dir_path = shift; my $dir = Win32::LongPath->new; unless ( $dir->opendirL($dir_path) ) { warn "!! unable to open $dir_path ($^E)"; return; } my @name = $dir->readdirL(); return \@name; } sub do_readdirL_while { my $dir_path = shift; my $dir = Win32::LongPath->new; unless ( $dir->opendirL($dir_path) ) { warn "!! unable to open $dir_path ($^E)"; return; } my @name; while ( my $name = $dir->readdirL() ) { push @name, $name; } return \@name; } my @path = ( 'C:/Documents and Settings', 'C:/ProgramData/Application +Data', 'C:/ProgramData/Desktop', 'C:/ProgramData/Start Menu', 'C:/ProgramData/Templates', #'C:/Users', 'C:/Windows/appcompat/Programs', 'C:/Windows/System32/Com/dmp +', #'C:/Windows/System32/spool', 'C:/Windows/System32/Tasks', 'C:/Windows/System32/Tasks_Migrated', 'C:/Windows/SysWOW64/Com/dmp +', 'C:/Windows/SysWOW64/Tasks', 'C:/Windows/Temp', ); foreach my $path (@path) { say "$path: array: ", dump( do_readdirL_arr($path) ), ' while: ', dump do_readdirL_while($path); } __DATA__ C:/Documents and Settings: array: ["g"] while: ["~"] C:/ProgramData/Application Data: array: ["u"] while: ["{"] C:/ProgramData/Desktop: array: ["H"] while: ["|"] C:/ProgramData/Start Menu: array: ["~"] while: ["m"] C:/ProgramData/Templates: array: ["K"] while: ["e"] C:/Windows/appcompat/Programs: array: ["u"] while: ["Q"] C:/Windows/System32/Com/dmp: array: ["m"] while: ["l"] C:/Windows/System32/Tasks: array: ["\\"] while: ["\x7F"] C:/Windows/System32/Tasks_Migrated: array: ["\22"] while: ["\f"] C:/Windows/SysWOW64/Com/dmp: array: ["p"] while: ["u"] C:/Windows/SysWOW64/Tasks: array: ["["] while: ["p"] C:/Windows/Temp: array: ["\x{105}"] while: ["\x{110}"]

        __DATA__ contains an example of output. It changes for every run.

        I have no access to the directories using the MS File explorer.

        When I in FS_sweep skip those directories, 'C:/Users' and C:/Windows/System32/spool' there are no freezes.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118172]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2021-09-18 18:05 GMT
Find Nodes?
    Voting Booth?

    No recent polls found