Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Find the n biggest files

by LinuxMatt (Initiate)
on Feb 09, 2012 at 20:36 UTC ( [id://952839]=perlquestion: print w/replies, xml ) Need Help??

LinuxMatt has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have written this little script in order to find the biggest file size, recursively. I would like to modify it in order to find the 'n' biggest files. 'n' would be an argument on the command line (1 by default). How could I achieve that efficiently ? With a 2 dimensional-array ? Thank you.
#!/usr/bin/perl -w use strict; use Cwd; use File::Find; no warnings 'File::Find'; my ($max,$name,$n) = (0,0,0); sub scanfile { return unless -f; # only files return unless -r; # readable return if -l; # not symlinks $n++; print "." unless ($n % 400); # "progress bar" my $sz = -s; if ($max < $sz) { # save biggest $max = $sz; $name = $File::Find::name; } } print "Scanning..."; $|=1; # flush output find(\&scanfile, cwd); # start in current directory recursively printf("\nBiggest: %d kb %s\n",$max/1024,$name);

Replies are listed 'Best First'.
Re: Find the n biggest files
by Eliya (Vicar) on Feb 09, 2012 at 21:35 UTC
    How could I achieve that efficiently ? With a 2 dimensional-array ?

    Yes, if you don't want to store the entire list of files for final sorting, you can incrementally update an array (or, more precisely, an AoA) that just keeps the n biggest files.   That is, for every file (in your scanfile routine), you'd do

    push @biggest, [$name, $size]; @biggest = sort {$a->[1] <=> $b->[1]} @biggest; shift @biggest if @biggest > $n; # remove smallest

    You'll be doing quite a lot of sorting, but if the n is reasonably small, those would be small lists only, so this shouldn't be a performance problem.  Also, there are ways to optimize this to avoid unnecessary sorting, but I'll leave this as an exercise...

Re: Find the n biggest files
by oko1 (Deacon) on Feb 09, 2012 at 20:55 UTC

    A number of years ago, I modified the 'largest20' script created by Randal Schwartz for my purposes. It's easy enough to tweak it to do what you want:

    #!/usr/bin/perl -w use File::Find; die "$0 <count> [dir]\n" unless @ARGV >= 1; my %size; my $count = $ARGV[0]; my $search = $ARGV[1] || $ENV{PWD}; find (sub {$size{$File::Find::name} = -s if -f;}, $search); my @sorted = sort {$size{$b} <=> $size{$a}} keys %size; splice @sorted, $count if @sorted > $count; printf "%10d %s\n", $size{$_}, $_ for @sorted
    -- 
    I hate storms, but calms undermine my spirits.
     -- Bernard Moitessier, "The Long Way"
Re: Find the n biggest files
by Anonymous Monk on Feb 09, 2012 at 21:35 UTC
    #!/usr/bin/perl -- use strict; use warnings; use File::Find::Rule; use Number::Bytes::Human qw(format_bytes); my $nBiggest = 20; my $progress = 1; Main( @ARGV ); exit( 0 ); sub Main { my @dirs = @_; push @dirs, '.' unless @dirs; # cwd my $rule = File::Find::Rule->file->readable; $rule->exec(\&progress) if $progress; my @files = $rule->in( @dirs ); @files = sort { $$b[0] <=> $$a[0] } map { [ -s $_ , $_ ] } @files; @files = @files[ 0 .. $nBiggest ]; print "\rBiggest $nBiggest\n"; for my $file ( @files ){ printf " %10s %s\n", format_bytes( $$file[0] ), $$file[1]; } } BEGIN { my $n = 0; my @s = qw[ * - \ | / ]; sub progress { local $|=1; print "\r", $s[ $n++ % 4 ]; 1; } } __END__ Biggest 20 282K blort.1.0.1.0.zip 104K blort.615.dll 104K blib/arch/auto/win32/blort/blort.dll 103K blort.cpp 100K blort.316.pll 100K blort.522.dll 54K win32-blort.tar.gz 49K x86/win32-blort.tar.gz 45K main.cpp 22K list.cpp 21K blort.pm 21K blib/lib/win32/blort.pm 20K fort.h 15K blib/html/lib/site/blort/blort.html 13K plfort.h 13K plfort.cpp 12K makefile 5.9K fort.cpp 4.6K rambo.diff 4.1K list.h 3.2K flap.cpp
      A derived version that doesn't load the full file list in memory:
      use strict; use warnings; use File::Find::Rule; my ($n, @dirs) = @ARGV; $n ||= 20; @dirs = '.' unless @dirs; sub top { my $file_sizes = shift; my @big = (sort { $file_sizes->{$b} <=> $file_sizes->{$a} } keys % +$file_sizes)[0..$n-1]; my %big = map { $_ => $file_sizes->{$_} } @big; return \%big } my $rule = File::Find::Rule->file->readable; my $file_sizes = {}; $rule->start(@dirs); while (defined (my $image = $rule->match) ) { my $size = (stat $image)[7]; $file_sizes->{$image} = $size if defined $size; $file_sizes = top($file_sizes) if keys %$file_sizes > ($n + 10000) +; } $file_sizes = top($file_sizes); print "Biggest $n\n"; for my $file ( sort { $file_sizes->{$b} <=> $file_sizes->{$a} } keys % +$file_sizes ){ printf " %8dK %s\n", $file_sizes->{$file} / 1024, $file; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://952839]
Approved by oko1
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-24 22:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found