Find the n biggest files

LinuxMatt has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have written this little script in order to find the biggest file size, recursively. I would like to modify it in order to find the 'n' biggest files. 'n' would be an argument on the command line (1 by default). How could I achieve that efficiently ? With a 2 dimensional-array ? Thank you.

#!/usr/bin/perl -w
use strict;
use Cwd;
use File::Find;
no warnings 'File::Find';

my ($max,$name,$n) = (0,0,0);

sub scanfile {
    return unless -f; # only files
    return unless -r; # readable
    return if -l;     # not symlinks
    $n++;
    print "." unless ($n % 400); # "progress bar"
    my $sz = -s;
    if ($max < $sz) { # save biggest
        $max = $sz;
        $name = $File::Find::name;
    }
}
print "Scanning...";
$|=1; # flush output
find(\&scanfile, cwd); # start in current directory recursively
printf("\nBiggest: %d kb %s\n",$max/1024,$name);
[download]

Comment on Find the n biggest files Download Code

Replies are listed 'Best First'.
Re: Find the n biggest files by Eliya (Vicar) on Feb 09, 2012 at 21:35 UTC
How could I achieve that efficiently ? With a 2 dimensional-array ? Yes, if you don't want to store the entire list of files for final sorting, you can incrementally update an array (or, more precisely, an AoA) that just keeps the n biggest files. That is, for every file (in your `scanfile` routine), you'd do `push @biggest, [$name, $size]; @biggest = sort {$a->[1] <=> $b->[1]} @biggest; shift @biggest if @biggest > $n; # remove smallest` [download] You'll be doing quite a lot of sorting, but if the n is reasonably small, those would be small lists only, so this shouldn't be a performance problem. Also, there are ways to optimize this to avoid unnecessary sorting, but I'll leave this as an exercise...	[reply] [d/l] [select]
Re: Find the n biggest files by oko1 (Deacon) on Feb 09, 2012 at 20:55 UTC
A number of years ago, I modified the 'largest20' script created by Randal Schwartz for my purposes. It's easy enough to tweak it to do what you want: `#!/usr/bin/perl -w use File::Find; die "$0 <count> [dir]\n" unless @ARGV >= 1; my %size; my $count = $ARGV[0]; my $search = $ARGV[1] \|\| $ENV{PWD}; find (sub {$size{$File::Find::name} = -s if -f;}, $search); my @sorted = sort {$size{$b} <=> $size{$a}} keys %size; splice @sorted, $count if @sorted > $count; printf "%10d %s\n", $size{$_}, $_ for @sorted` [download] -- I hate storms, but calms undermine my spirits. -- Bernard Moitessier, "The Long Way"	[reply] [d/l]
Re: Find the n biggest files by Anonymous Monk on Feb 09, 2012 at 21:35 UTC
#!/usr/bin/perl -- use strict; use warnings; use File::Find::Rule; use Number::Bytes::Human qw(format_bytes); my $nBiggest = 20; my $progress = 1; Main( @ARGV ); exit( 0 ); sub Main { my @dirs = @_; push @dirs, '.' unless @dirs; # cwd my $rule = File::Find::Rule->file->readable; $rule->exec(\&progress) if $progress; my @files = $rule->in( @dirs ); @files = sort { $$b[0] <=> $$a[0] } map { [ -s $_ , $_ ] } @files; @files = @files[ 0 .. $nBiggest ]; print "\rBiggest $nBiggest\n"; for my $file ( @files ){ printf " %10s %s\n", format_bytes( $$file[0] ), $$file[1]; } } BEGIN { my $n = 0; my @s = qw[ * - \ \| / ]; sub progress { local $\|=1; print "\r", $s[ $n++ % 4 ]; 1; } } __END__ Biggest 20 282K blort.1.0.1.0.zip 104K blort.615.dll 104K blib/arch/auto/win32/blort/blort.dll 103K blort.cpp 100K blort.316.pll 100K blort.522.dll 54K win32-blort.tar.gz 49K x86/win32-blort.tar.gz 45K main.cpp 22K list.cpp 21K blort.pm 21K blib/lib/win32/blort.pm 20K fort.h 15K blib/html/lib/site/blort/blort.html 13K plfort.h 13K plfort.cpp 12K makefile 5.9K fort.cpp 4.6K rambo.diff 4.1K list.h 3.2K flap.cpp [download]	[reply] [d/l]
Re^2: Find the n biggest files by salva (Canon) on Feb 10, 2012 at 08:57 UTC
A derived version that doesn't load the full file list in memory: use strict; use warnings; use File::Find::Rule; my ($n, @dirs) = @ARGV; $n \|\|= 20; @dirs = '.' unless @dirs; sub top { my $file_sizes = shift; my @big = (sort { $file_sizes->{$b} <=> $file_sizes->{$a} } keys % +$file_sizes)[0..$n-1]; my %big = map { $_ => $file_sizes->{$_} } @big; return \%big } my $rule = File::Find::Rule->file->readable; my $file_sizes = {}; $rule->start(@dirs); while (defined (my $image = $rule->match) ) { my $size = (stat $image)[7]; $file_sizes->{$image} = $size if defined $size; $file_sizes = top($file_sizes) if keys %$file_sizes > ($n + 10000) +; } $file_sizes = top($file_sizes); print "Biggest $n\n"; for my $file ( sort { $file_sizes->{$b} <=> $file_sizes->{$a} } keys % +$file_sizes ){ printf " %8dK %s\n", $file_sizes->{$file} / 1024, $file; } [download]	[reply] [d/l]


Syntactic Confectionery Delight
	PerlMonks