, I decided to post this tool I created a while ago, when I wanted to track statistics on directory trees. For one or more paths, it lists all the sub-directories in the path, and for each of those, it shows how many sub-dirs, symbolic links and data files it contains, along with the total KB count for the data files. Lots of options for doing things different ways, focusing on date ranges, etc.
This does not use File::Find or related modules, because I found those to be too slow for doing really big trees. But I also had to be careful about using a compiled "find" utility: Although "find" is a lot faster (than File::Find), I found some cases (too many files in one directory) where it can fail miserably.
Meanwhile a simple, recursive "opendir/readdir" process works reasonably and consistently well in all cases (and is also a lot faster than File::Find), so I use that by default. (But I allow using "find" as an option, since it tends to be about 10% faster when it works, which is almost all the time.)
#!/usr/bin/perl
use strict;
use POSIX;
use Time::Local;
use Getopt::Long;
use Pod::Usage;
my $start_time = strftime( "%Y%m%d_%H%M", localtime( $^T ));
my $ageSkip = undef;
my @paths;
my %opt;
my $cmd_ok = GetOptions( \%opt, qw/l:s a=i b=i s=s t d f p m H i u/ );
pod2usage(1) unless ( $cmd_ok );
pod2usage( -exitstatus => 0, -verbose => 2 ) if ( $opt{m} );
# If user wants files in a specified age range, we convert the limit
# date(s) to "script start time minus limit-date, in days" to make the
# limits comparable to the value returned by "-M"
# "a(fter)" = newer than YYYYMMDD, "b(efore)" = older than YYYYMMDD
for my $o ( qw/a b/ ) {
if ( exists( $opt{$o} )) {
( $opt{$o} =~ /^(\d{4})(\d{2})(\d{2})$/ )
or pod2usage( -message => "Bad date spec for -$o\n",
-exitstatus => 2, -verbose => 1 );
$opt{$o} = ( $^T - timelocal( 0, 0, 0, $3, $2-1, $1 )) / (3600
+ * 24);
}
}
if ( $opt{s} and not ( -d $opt{s} and -w $opt{s} )) {
die "Bad value for '-s' ($opt{s}): must be a directory with write
+access\n";
}
# $ageSkip will be a reference to a boolean subroutine that applies
# the appropriate test, based on the limit date(s) given; the sub
# will be called with the value of -M for a given file being tested.
if ( $opt{a} and $opt{b} ) {
if ( $opt{a} < $opt{b} ) { # tally files that are "after a" and "
+before b"
$ageSkip = sub { return 1 if ($_[0] > $opt{a} and $_[0] < $opt
+{b}) };
}
elsif ( $opt{a} > $opt{b} ) { # tally files that are "before b" or
+ "after a"
$ageSkip = sub { return 1 if ($_[0] > $opt{a} or $_[0] < $opt{
+b}) };
}
else {
pod2usage( -message => "Setting -a and -b options to the same
+date?? No. Try again.\n",
-exitstatus => 2, -verbose => 1 );
die ;
}
}
elsif ( $opt{a} ) { # tally files that are "after a"
$ageSkip = sub { return 1 if ($_[0] > $opt{a}) };
}
elsif ( $opt{b} ) { # tally files that are "before b"
$ageSkip = sub { return 1 if ($_[0] < $opt{b}) };
}
if ( exists( $opt{l} )) {
$opt{l} ||= '-';
open( L, $opt{l} )
or pod2usage( -message => "Unable to open path list file $opt{
+l}: $!\n",
-exitstatus => 2, -verbose => 1 );
while ( <L> ) {
chomp;
next if ( -l and not $opt{H} );
push @paths, $_ if ( -d _ );
}
close L;
pod2usage( -message => "No usable paths found in input list $opt{l
+}\n",
-exitstatus => 2, -verbose => 1 ) unless @paths;
}
else {
push @ARGV, "." if ( @ARGV == 0 );
for ( @ARGV ) {
if ( -l and not $opt{H} ) {
warn "Skipping symlink $_ -- use '-H' to follow symlink ar
+gs\n";
next;
}
if ( not -d ) {
warn "Skipping $_ -- not a directory\n";
}
else {
push @paths, $_;
}
}
}
$|++; # turn off stdout buffering
my $starttime = my $global_starttime = time;
warn sprintf( "data-dir-scan of %d paths started at %s\n",
scalar @paths, $start_time ) if ( $opt{t} );
my $extra_col_fmt = ( $opt{i} ) ? ' %6d %8d' : '';
# the next two variables are used as globals in 'tabulate()' sub
my %inode_seen;
my $outfmt = "%6d %4d %5d %6d %8d$extra_col_fmt %s%s\n";
for my $path ( @paths ) {
$path =~ s:/$::; # remove trailing slash, if any
if ( $opt{s} ) { # save each path scan in a separate output file
( my $outname = $path ) =~ s{/+}{%}g;
open( STDOUT, ">$opt{s}/$outname.scan.$start_time" ) or
die "Unable to save scan results in $opt{s}/$outname.scan.
+$start_time: $!";
}
if ( $opt{f} ) {
my @cmd = ( 'find', $path, '-type', 'd', '-print0' );
splice( @cmd, 1, 0, '-H' ) if ( $opt{H} );
open( my $find, "-|", @cmd ) or die "Unable to launch find: $!
+\n";
local $/ = chr(0);
while ( <$find> ) {
chomp;
tabulate( $_ );
}
close $find;
} else {
tabulate( $path );
}
if ( $opt{t} ) {
my $elapsed = time - $starttime;
$starttime += $elapsed;
my ( $hrs, $min, $sec ) = ( 0, int( $elapsed / 60 ), $elapsed
+% 60 );
if ( $min > 60 ) {
$hrs = int( $min / 60 );
$min %= 60;
}
warn sprintf( " %.2d:%.2d:%.2d elapsed in scan of %s\n",
$hrs, $min, $sec, $path );
}
}
if ( $opt{t} ) {
my $elapsed = time - $global_starttime;
my ( $hrs, $min, $sec ) = ( 0, int( $elapsed / 60 ), $elapsed % 60
+ );
if ( $min > 60 ) {
$hrs = int( $min / 60 );
$min %= 60;
}
warn sprintf( " finished %d paths at %s -- %.2d:%.2d:%.2d elapsed\
+n\n",
scalar @paths, strftime( "%Y%m%d_%H%M", localtime())
+, $hrs, $min, $sec );
}
sub tabulate
{
my ( $dirname ) = @_;
my $ecount = my $lcount = my $dcount = my $fcount = my $bcount = m
+y $hlfcount = my $hlbcount = 0;
my $dh;
my $dirdate = '';
my @dstat;
if ( $opt{d} | $opt{u} ) {
@dstat = stat $dirname;
$dirdate .= strftime( "%F_%H:%M:%S ", ( localtime( $dstat[9] )
+)) if ( $opt{d} );
if ( $opt{u} ) {
my $userid = getpwuid( $dstat[4] ) || $dstat[4];
$userid =~ s/^\d{9,}/NotKnown/;
$dirdate .= sprintf( "u:=%-8s ", $userid );
}
}
if ( ! opendir( $dh, $dirname )) {
if ( $opt{p} ) {
my $ncols = ( $opt{i} ) ? 7 : 5;
printf( "%6s %4s %5s %6s %8s$extra_col_fmt %s%s\n",
('-') x $ncols, $dirdate, $dirname );
}
return;
}
while ( my $file = readdir( $dh )) {
next if ( $file =~ /^\.{1,2}$/ );
$ecount++;
if ( -l "$dirname/$file" ) {
$lcount++;
}
elsif ( -d _ ) {
tabulate( "$dirname/$file" ) unless ( $opt{f} );
$dcount++;
}
elsif ( -f _ ) {
next if ( defined $ageSkip and $ageSkip->( -M _ ));
if ( $dirdate =~ /(\d{4}-\d\d-\d\d_\S+ )(\S+ +)?/ ) {
my ( $dtime, $duser ) = ( $1, $2 );
my $ftime = strftime( "%F_%H:%M:%S ", (localtime((stat
+ _)[9])));
$dirdate = $ftime . $duser if ( $dtime lt $ftime );
}
if ( $opt{i} ) {
my ( $inode, $nlinks ) = ( stat _ )[1,3];
if ( $nlinks > 1 ) {
if ( exists( $inode_seen{$inode} )) {
$hlfcount++;
$hlbcount += ( -s _ );
next;
}
else {
$inode_seen{$inode} = undef;
}
}
}
$fcount++;
$bcount += ( -s _ );
}
}
closedir $dh;
next if ( defined $ageSkip and $fcount == 0 );
my @outnums = ( $ecount, $dcount, $lcount, $fcount, int($bcount/10
+24) );
push( @outnums, $hlfcount, int($hlbcount/1024) ) if ( $opt{i} );
printf( $outfmt, @outnums, $dirdate, $dirname );
}
=head1 NAME
data-dir-scan -- report directories, file counts, disk usage
=head1 SYNOPSIS
data-dir-scan.perl [-a|-b YYYYMMDD] [-d] [-f] [-p] [-t] [-u] [-H] [-i
+] [-s [outpath]]
[-l [path.list] | path ... ]
-a : only scan files after date YYYYMMDD
-b : only scan files before date YYYYMMDD
-d : report modification date of each directory
-f : use unix "find" (default: use recursive opendir/readdir)
-l : read list of paths to scan (from path.list file or STDIN)
-p : report directories that yield 'permission denied'
-s : save each path scan in a separate output file (path.scan)
-t : report on how long it takes to finish the scan on STDERR
-u : report owner of each directory
-H : if 'path' arg is a symlink, scan symlink target
-i : keep track of data files having multiple hard links
-m : print man page
=head1 DESCRIPTION
If run with no command-line arguments, it scans the current directory
"." and produces a report with one line of data for this directory and
every subdirectory below it.
If you provide the name(s) of one or more directories, it will scan
and report on each of these (and all their subdirectories) in turn.
Or, you can create a simple text file that contains a list of paths to
scan (one path per line), and give the name of that list file with the
"-l" option. The "-l" option can be used without a file name, in
which case the list of one or more paths to scan will be read from
STDIN (one path per line).
Strings that you provide as path names to scan (via command-line args
or '-l' input) are checked first to see if they exist as directories,
and are ignored/discarded when they do not. (In this case, warnings
are reported for command-line args, but not for '-l' inputs.) By
default, path names that are found to be symlinks are ignored, but as
with 'find' and other common utilities, the '-H' option will cause
symlinks to be followed if their targets are found to be directories.
If the supplied path strings do not yield any directories, the script
will exit with an appropriate error message and the usage summary.
By default, the results are all printed to stdout. With the "-s"
option, the results for each path being scanned will be saved in a
file named "path.scan" (where "path" is replaced with the actual path
name, but with slash characters changed to percent-signs "%"). If "-a
+"
and/or "-b" options have also been used to limit the date range of the
listing, these option values are included as part of each output file
name.
For each path to be scanned, we first get a list of all
subdirectories. (This always excludes symlinks that point to other
directories; this program does not have an equivalent of the '-L'
option in 'find' and other utilities.) Then for each of these, we do
a tally of the directory contents and output a one-line summary, with
the following six (or so) fixed-width, space-separated fields (where N
denotes a numeric value, and S denotes a string value):
=over 4
=item N1. total number of entries (all types) in the directory
=item N2. number of immediate subdirectories
=item N3. number of symlinks
=item N4. number of plain data files
=item N5. total kilobyte count of plain data files
=item N6. (optional, if '-i' is used) number of hard-linked files
=item N7. (optional, if '-i' is used) total KB of hard-linked files
=item S1. (optional, if '-d' is used) modification date of path
=item S2. (optional, if '-u' is used) owner of path
=item S3. name of path
=back
When the "-i" option is used, we check each data file to see how many
hard links it has. For every data file having more than one hard
link, we keep track of its inode number; the file's byte count will be
summed into the KB count for the first path found to contain the file.
If that inode shows up again in the scan (in the same directory or a
later one), we increment the hard-link count and hard-link KB (columns
N6 and N7, instead of N4 and N5) for the directory containing the
latter occurrence(s) of the inode.
When the "-d" option is used, the date reported for the directory will
be either the modification date of the directory itself, or the most
recent modification date of any data file within the directory,
whichever is newer.
When the "-u" option is used, the owner's user name is shown with the
prefix "u:=" (e.g. "u:=root"), making it easier to grep for specific
owners. When the owner's name is not known, the numeric user-ID is
shown instead; implausible user-ID numbers are shown as "NotKnown".
Note that the first column represents the sum of the second, third and
fourth columns (or, if the '-h' option is used, the sum of those plus
the 6th column), and all these columns count only the immediate
content of the given path name. To get a full-depth summary of all
subdirs, all symlinks, all data files, all KB subsumed (and/or all
hard links, if -h is used) under a given path, you need to grep for
all lines in the output that contain that path, and sum the respective
columns of numbers.
The "-a YYYYMMDD" and/or "-b YYYYYMMDD" options can be used to limit
the report so that only data files with modification times within a
given date range will be tallied -- i.e. "after" or "before" a given
date. In this case, a directory will not be listed unless it contains
data files within the range. The first column will still show the
total number of entries in the directory (regardless of age), but the
number and KB count of data files (columns 4 and 5) will reflect only
those files in the chosen range.
If you supply dates for both "-a" and "-b", the following rules apply:
=over 4
=item * if "-a" (after) is an earlier date than "-b" (before), the
tally will count files that are BOTH newer than "-a" AND older than
"-b" -- i.e. within the time span between -a and -b.
=item * if "-a" is a later date than "-b", the tally will count files
that are EITHER newer than "-a" OR older than "-b" -- i.e. two
discrete ranges that exclude the time span between -a and -b. (This
might not be needed very often, but at least there is a certain
logical consistency about it.)
=item * if "-a" is identical to "-b", this is considered an error.
=back
In all cases, the dates you supply are interpreted with hours, minutes
and seconds set to zero.
When the "-t" option is used, there will be a line printed to STDERR
for each path that you specify for scanning, to report how long it
took to scan that path. Depending on circumstances, STDERR might also
get warning messages from the "find" utility (see '-f' option below),
reporting things like "permission denied", etc.
The '-f' option invokes the unix "find" utility to locate
subdirectories under the given target path(s). By default, we use a
"pure perl" approach (recursive 'opendir/readdir') to descend through
a directory tree depth-first. While this tends to be a little slower
(by 10% or so), it runs with a relatively constant memory footprint,
regardless of the number of files (or total file name length) in any
single directory, unlike some implementations of "/usr/bin/find".
Also, "find" will report "permission denied" when trying to descend
into paths where the user is not allowed to go, whereas the default
method quietly skips these paths.
By default, if the user does not have permission to search a given
directory (i.e. 'read' and 'execute' permissions are not granted),
the directory will not appear at all in the output table. Using the
'-p' option will cause such directories to be listed, with hyphens
instead of numbers in the first five columns.
=head1 BUGS
The current handling of the '-i' option (to track usage of hard links)
is not smart enough to keep sets of inode numbers separate according
to the physical disk volumes being scanned. If '-i' is used on a run
where paths on two or more distinct physical volumes are being
scanned, and if data files exist on the different volumes with 2 or
more hard links, and if any of their respective inode numbers on the
different volumes happen to match, the output results will be
misleading. Of course, if a given run only scans paths on a single
physical volume, this is problem can't arise.
=head1 AUTHOR
David Graff <graff (at) ldc (dot) upenn (dot) edu>
=cut
(update: usage synopsis in POD now matches the one in the code -- yeah I know, I should use Pod::Usage;...)
UPDATED on 2011-03-02: Okay, added Pod::Usage (love it). Also added '-p' option to include "permission denied" directories in output table (but with no tally numbers, of course). Also cleaned up / fixed the logic and man.page descriptions for '-l', '-s' and '-t' options.
YET ANOTHER UPDATE on 2012-03-21: Having just seen AnonyMonk's recent reply, I'm happy to report that I had already added just the feature being asked about (reporting owners), as well as some other stuff (checking for hard-links, allowing the top-level path to be a symbolic link) -- check the "-u", "-i" and "-H" options.