Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Get useful info about a directory tree

by graff (Chancellor)
on Jan 22, 2011 at 18:17 UTC ( #883695=CUFP: print w/ replies, xml ) Need Help??

In response to this SoPW node, I decided to post this tool I created a while ago, when I wanted to track statistics on directory trees. For one or more paths, it lists all the sub-directories in the path, and for each of those, it shows how many sub-dirs, symbolic links and data files it contains, along with the total KB count for the data files. Lots of options for doing things different ways, focusing on date ranges, etc.

This does not use File::Find or related modules, because I found those to be too slow for doing really big trees. But I also had to be careful about using a compiled "find" utility: Although "find" is a lot faster (than File::Find), I found some cases (too many files in one directory) where it can fail miserably.

Meanwhile a simple, recursive "opendir/readdir" process works reasonably and consistently well in all cases (and is also a lot faster than File::Find), so I use that by default. (But I allow using "find" as an option, since it tends to be about 10% faster when it works, which is almost all the time.)

#!/usr/bin/perl use strict; use POSIX; use Time::Local; use Getopt::Long; use Pod::Usage; my $start_time = strftime( "%Y%m%d_%H%M", localtime( $^T )); my $ageSkip = undef; my @paths; my %opt; my $cmd_ok = GetOptions( \%opt, qw/l:s a=i b=i s=s t d f p m H i u/ ); pod2usage(1) unless ( $cmd_ok ); pod2usage( -exitstatus => 0, -verbose => 2 ) if ( $opt{m} ); # If user wants files in a specified age range, we convert the limit # date(s) to "script start time minus limit-date, in days" to make the # limits comparable to the value returned by "-M" # "a(fter)" = newer than YYYYMMDD, "b(efore)" = older than YYYYMMDD for my $o ( qw/a b/ ) { if ( exists( $opt{$o} )) { ( $opt{$o} =~ /^(\d{4})(\d{2})(\d{2})$/ ) or pod2usage( -message => "Bad date spec for -$o\n", -exitstatus => 2, -verbose => 1 ); $opt{$o} = ( $^T - timelocal( 0, 0, 0, $3, $2-1, $1 )) / (3600 + * 24); } } if ( $opt{s} and not ( -d $opt{s} and -w $opt{s} )) { die "Bad value for '-s' ($opt{s}): must be a directory with write +access\n"; } # $ageSkip will be a reference to a boolean subroutine that applies # the appropriate test, based on the limit date(s) given; the sub # will be called with the value of -M for a given file being tested. if ( $opt{a} and $opt{b} ) { if ( $opt{a} < $opt{b} ) { # tally files that are "after a" and " +before b" $ageSkip = sub { return 1 if ($_[0] > $opt{a} and $_[0] < $opt +{b}) }; } elsif ( $opt{a} > $opt{b} ) { # tally files that are "before b" or + "after a" $ageSkip = sub { return 1 if ($_[0] > $opt{a} or $_[0] < $opt{ +b}) }; } else { pod2usage( -message => "Setting -a and -b options to the same +date?? No. Try again.\n", -exitstatus => 2, -verbose => 1 ); die ; } } elsif ( $opt{a} ) { # tally files that are "after a" $ageSkip = sub { return 1 if ($_[0] > $opt{a}) }; } elsif ( $opt{b} ) { # tally files that are "before b" $ageSkip = sub { return 1 if ($_[0] < $opt{b}) }; } if ( exists( $opt{l} )) { $opt{l} ||= '-'; open( L, $opt{l} ) or pod2usage( -message => "Unable to open path list file $opt{ +l}: $!\n", -exitstatus => 2, -verbose => 1 ); while ( <L> ) { chomp; next if ( -l and not $opt{H} ); push @paths, $_ if ( -d _ ); } close L; pod2usage( -message => "No usable paths found in input list $opt{l +}\n", -exitstatus => 2, -verbose => 1 ) unless @paths; } else { push @ARGV, "." if ( @ARGV == 0 ); for ( @ARGV ) { if ( -l and not $opt{H} ) { warn "Skipping symlink $_ -- use '-H' to follow symlink ar +gs\n"; next; } if ( not -d ) { warn "Skipping $_ -- not a directory\n"; } else { push @paths, $_; } } } $|++; # turn off stdout buffering my $starttime = my $global_starttime = time; warn sprintf( "data-dir-scan of %d paths started at %s\n", scalar @paths, $start_time ) if ( $opt{t} ); my $extra_col_fmt = ( $opt{i} ) ? ' %6d %8d' : ''; # the next two variables are used as globals in 'tabulate()' sub my %inode_seen; my $outfmt = "%6d %4d %5d %6d %8d$extra_col_fmt %s%s\n"; for my $path ( @paths ) { $path =~ s:/$::; # remove trailing slash, if any if ( $opt{s} ) { # save each path scan in a separate output file ( my $outname = $path ) =~ s{/+}{%}g; open( STDOUT, ">$opt{s}/$outname.scan.$start_time" ) or die "Unable to save scan results in $opt{s}/$outname.scan. +$start_time: $!"; } if ( $opt{f} ) { my @cmd = ( 'find', $path, '-type', 'd', '-print0' ); splice( @cmd, 1, 0, '-H' ) if ( $opt{H} ); open( my $find, "-|", @cmd ) or die "Unable to launch find: $! +\n"; local $/ = chr(0); while ( <$find> ) { chomp; tabulate( $_ ); } close $find; } else { tabulate( $path ); } if ( $opt{t} ) { my $elapsed = time - $starttime; $starttime += $elapsed; my ( $hrs, $min, $sec ) = ( 0, int( $elapsed / 60 ), $elapsed +% 60 ); if ( $min > 60 ) { $hrs = int( $min / 60 ); $min %= 60; } warn sprintf( " %.2d:%.2d:%.2d elapsed in scan of %s\n", $hrs, $min, $sec, $path ); } } if ( $opt{t} ) { my $elapsed = time - $global_starttime; my ( $hrs, $min, $sec ) = ( 0, int( $elapsed / 60 ), $elapsed % 60 + ); if ( $min > 60 ) { $hrs = int( $min / 60 ); $min %= 60; } warn sprintf( " finished %d paths at %s -- %.2d:%.2d:%.2d elapsed\ +n\n", scalar @paths, strftime( "%Y%m%d_%H%M", localtime()) +, $hrs, $min, $sec ); } sub tabulate { my ( $dirname ) = @_; my $ecount = my $lcount = my $dcount = my $fcount = my $bcount = m +y $hlfcount = my $hlbcount = 0; my $dh; my $dirdate = ''; my @dstat; if ( $opt{d} | $opt{u} ) { @dstat = stat $dirname; $dirdate .= strftime( "%F_%H:%M:%S ", ( localtime( $dstat[9] ) +)) if ( $opt{d} ); if ( $opt{u} ) { my $userid = getpwuid( $dstat[4] ) || $dstat[4]; $userid =~ s/^\d{9,}/NotKnown/; $dirdate .= sprintf( "u:=%-8s ", $userid ); } } if ( ! opendir( $dh, $dirname )) { if ( $opt{p} ) { my $ncols = ( $opt{i} ) ? 7 : 5; printf( "%6s %4s %5s %6s %8s$extra_col_fmt %s%s\n", ('-') x $ncols, $dirdate, $dirname ); } return; } while ( my $file = readdir( $dh )) { next if ( $file =~ /^\.{1,2}$/ ); $ecount++; if ( -l "$dirname/$file" ) { $lcount++; } elsif ( -d _ ) { tabulate( "$dirname/$file" ) unless ( $opt{f} ); $dcount++; } elsif ( -f _ ) { next if ( defined $ageSkip and $ageSkip->( -M _ )); if ( $dirdate =~ /(\d{4}-\d\d-\d\d_\S+ )(\S+ +)?/ ) { my ( $dtime, $duser ) = ( $1, $2 ); my $ftime = strftime( "%F_%H:%M:%S ", (localtime((stat + _)[9]))); $dirdate = $ftime . $duser if ( $dtime lt $ftime ); } if ( $opt{i} ) { my ( $inode, $nlinks ) = ( stat _ )[1,3]; if ( $nlinks > 1 ) { if ( exists( $inode_seen{$inode} )) { $hlfcount++; $hlbcount += ( -s _ ); next; } else { $inode_seen{$inode} = undef; } } } $fcount++; $bcount += ( -s _ ); } } closedir $dh; next if ( defined $ageSkip and $fcount == 0 ); my @outnums = ( $ecount, $dcount, $lcount, $fcount, int($bcount/10 +24) ); push( @outnums, $hlfcount, int($hlbcount/1024) ) if ( $opt{i} ); printf( $outfmt, @outnums, $dirdate, $dirname ); } =head1 NAME data-dir-scan -- report directories, file counts, disk usage =head1 SYNOPSIS data-dir-scan.perl [-a|-b YYYYMMDD] [-d] [-f] [-p] [-t] [-u] [-H] [-i +] [-s [outpath]] [-l [path.list] | path ... ] -a : only scan files after date YYYYMMDD -b : only scan files before date YYYYMMDD -d : report modification date of each directory -f : use unix "find" (default: use recursive opendir/readdir) -l : read list of paths to scan (from path.list file or STDIN) -p : report directories that yield 'permission denied' -s : save each path scan in a separate output file (path.scan) -t : report on how long it takes to finish the scan on STDERR -u : report owner of each directory -H : if 'path' arg is a symlink, scan symlink target -i : keep track of data files having multiple hard links -m : print man page =head1 DESCRIPTION If run with no command-line arguments, it scans the current directory "." and produces a report with one line of data for this directory and every subdirectory below it. If you provide the name(s) of one or more directories, it will scan and report on each of these (and all their subdirectories) in turn. Or, you can create a simple text file that contains a list of paths to scan (one path per line), and give the name of that list file with the "-l" option. The "-l" option can be used without a file name, in which case the list of one or more paths to scan will be read from STDIN (one path per line). Strings that you provide as path names to scan (via command-line args or '-l' input) are checked first to see if they exist as directories, and are ignored/discarded when they do not. (In this case, warnings are reported for command-line args, but not for '-l' inputs.) By default, path names that are found to be symlinks are ignored, but as with 'find' and other common utilities, the '-H' option will cause symlinks to be followed if their targets are found to be directories. If the supplied path strings do not yield any directories, the script will exit with an appropriate error message and the usage summary. By default, the results are all printed to stdout. With the "-s" option, the results for each path being scanned will be saved in a file named "path.scan" (where "path" is replaced with the actual path name, but with slash characters changed to percent-signs "%"). If "-a +" and/or "-b" options have also been used to limit the date range of the listing, these option values are included as part of each output file name. For each path to be scanned, we first get a list of all subdirectories. (This always excludes symlinks that point to other directories; this program does not have an equivalent of the '-L' option in 'find' and other utilities.) Then for each of these, we do a tally of the directory contents and output a one-line summary, with the following six (or so) fixed-width, space-separated fields (where N denotes a numeric value, and S denotes a string value): =over 4 =item N1. total number of entries (all types) in the directory =item N2. number of immediate subdirectories =item N3. number of symlinks =item N4. number of plain data files =item N5. total kilobyte count of plain data files =item N6. (optional, if '-i' is used) number of hard-linked files =item N7. (optional, if '-i' is used) total KB of hard-linked files =item S1. (optional, if '-d' is used) modification date of path =item S2. (optional, if '-u' is used) owner of path =item S3. name of path =back When the "-i" option is used, we check each data file to see how many hard links it has. For every data file having more than one hard link, we keep track of its inode number; the file's byte count will be summed into the KB count for the first path found to contain the file. If that inode shows up again in the scan (in the same directory or a later one), we increment the hard-link count and hard-link KB (columns N6 and N7, instead of N4 and N5) for the directory containing the latter occurrence(s) of the inode. When the "-d" option is used, the date reported for the directory will be either the modification date of the directory itself, or the most recent modification date of any data file within the directory, whichever is newer. When the "-u" option is used, the owner's user name is shown with the prefix "u:=" (e.g. "u:=root"), making it easier to grep for specific owners. When the owner's name is not known, the numeric user-ID is shown instead; implausible user-ID numbers are shown as "NotKnown". Note that the first column represents the sum of the second, third and fourth columns (or, if the '-h' option is used, the sum of those plus the 6th column), and all these columns count only the immediate content of the given path name. To get a full-depth summary of all subdirs, all symlinks, all data files, all KB subsumed (and/or all hard links, if -h is used) under a given path, you need to grep for all lines in the output that contain that path, and sum the respective columns of numbers. The "-a YYYYMMDD" and/or "-b YYYYYMMDD" options can be used to limit the report so that only data files with modification times within a given date range will be tallied -- i.e. "after" or "before" a given date. In this case, a directory will not be listed unless it contains data files within the range. The first column will still show the total number of entries in the directory (regardless of age), but the number and KB count of data files (columns 4 and 5) will reflect only those files in the chosen range. If you supply dates for both "-a" and "-b", the following rules apply: =over 4 =item * if "-a" (after) is an earlier date than "-b" (before), the tally will count files that are BOTH newer than "-a" AND older than "-b" -- i.e. within the time span between -a and -b. =item * if "-a" is a later date than "-b", the tally will count files that are EITHER newer than "-a" OR older than "-b" -- i.e. two discrete ranges that exclude the time span between -a and -b. (This might not be needed very often, but at least there is a certain logical consistency about it.) =item * if "-a" is identical to "-b", this is considered an error. =back In all cases, the dates you supply are interpreted with hours, minutes and seconds set to zero. When the "-t" option is used, there will be a line printed to STDERR for each path that you specify for scanning, to report how long it took to scan that path. Depending on circumstances, STDERR might also get warning messages from the "find" utility (see '-f' option below), reporting things like "permission denied", etc. The '-f' option invokes the unix "find" utility to locate subdirectories under the given target path(s). By default, we use a "pure perl" approach (recursive 'opendir/readdir') to descend through a directory tree depth-first. While this tends to be a little slower (by 10% or so), it runs with a relatively constant memory footprint, regardless of the number of files (or total file name length) in any single directory, unlike some implementations of "/usr/bin/find". Also, "find" will report "permission denied" when trying to descend into paths where the user is not allowed to go, whereas the default method quietly skips these paths. By default, if the user does not have permission to search a given directory (i.e. 'read' and 'execute' permissions are not granted), the directory will not appear at all in the output table. Using the '-p' option will cause such directories to be listed, with hyphens instead of numbers in the first five columns. =head1 BUGS The current handling of the '-i' option (to track usage of hard links) is not smart enough to keep sets of inode numbers separate according to the physical disk volumes being scanned. If '-i' is used on a run where paths on two or more distinct physical volumes are being scanned, and if data files exist on the different volumes with 2 or more hard links, and if any of their respective inode numbers on the different volumes happen to match, the output results will be misleading. Of course, if a given run only scans paths on a single physical volume, this is problem can't arise. =head1 AUTHOR David Graff <graff (at) ldc (dot) upenn (dot) edu> =cut

(updated to include "readmore" tags)

(update: usage synopsis in POD now matches the one in the code -- yeah I know, I should use Pod::Usage; ...)

UPDATED on 2011-03-02: Okay, added Pod::Usage (love it). Also added '-p' option to include "permission denied" directories in output table (but with no tally numbers, of course). Also cleaned up / fixed the logic and man.page descriptions for '-l', '-s' and '-t' options.

YET ANOTHER UPDATE on 2012-03-21: Having just seen AnonyMonk's recent reply, I'm happy to report that I had already added just the feature being asked about (reporting owners), as well as some other stuff (checking for hard-links, allowing the top-level path to be a symbolic link) -- check the "-u", "-i" and "-H" options.

Comment on Get useful info about a directory tree
Select or Download Code
Reaped: Re: Get useful info about a directory tree
by NodeReaper (Curate) on Jan 24, 2011 at 03:55 UTC
Re: Get useful info about a directory tree
by merlyn (Sage) on Jan 24, 2011 at 22:40 UTC
    Meanwhile a simple, recursive "opendir/readdir" process works reasonably and consistently well in all cases (and is also a lot faster than File::Find),
    Since that's precisely what File::Find is doing, I'm curious as to how you were using it. Do you have some code from your failed File::Find experiments that we can benchmark?

    -- Randal L. Schwartz, Perl hacker

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

      Thanks for asking... Yes, I'd be grateful if someone could check my work on this: On a "mature" darwin laptop (2.16 GHz macbook, osx 10.6.6, perl 5.10.0, File::Find 1.12), with a tree containing 800 or so subdirectories at various depths, I get output like this (29485 is the count of data files):
      Benchmark: timing 30 iterations of Shell find pipe, _ Opendir recursion, __ File::Find module... 29485 Shell find pipe: 7 wallclock secs ( 0.56 usr 0.40 sys + 1.03 cusr 3.87 csys = 5.8 +6 CPU) @ 5.12/s (n=30) 29485 _ Opendir recursion: 12 wallclock secs ( 3.79 usr + 5.63 sys = 9.42 CPU) @ 3.18/s (n=30 +) 29485 __ File::Find module: 20 wallclock secs ( 8.17 usr + 7.32 sys = 15.49 CPU) @ 1.94/s (n=30 +)
      If you remove the parts about writing file names to temp files, the times all go down, but the relative proportions stay about the same. (I was writing temp files just to be sure I got the same output with each method. It took a few tries to handle the symbolic links consistently.)

      I presume File::Find has a noticeable amount of overhead relative to a minimal recursion of opendir/readdir, but I haven't looked at the source code to check on that.

        This script is really nice and usefull. Is it possible to include the owner of the files/directory?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://883695]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2014-10-02 13:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (60 votes), past polls