nnigam1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello great monks. I am trying to sort a directory by file size. My plan is to create a script that performs a compare of two files irrespective of the name to find duplicate files in a directory or two different directories. I am using md5sum for the comparison, but do not want to waste time on generating the checksum if files are of different size, hence the sort. In my script, I am using the command
>>@sDir=sort {-s $a <=> -s $b } (readdir D1);
This works for the first two . and .. but after that I get the error "Use of uninitialized value in numeric comparison (<=>) at line 5". I thought that I had this working, but now am not sure. Maybe my test folder had all numerical file names. Please help oh wise and wonderful monks.

Replies are listed 'Best First'.
Re: Sort directory by file size
by choroba (Archbishop) on May 18, 2016 at 16:05 UTC
    That's probably the common trap of readdir: it returns the file names, not file paths.


    my @sDir = sort { -s "$dir/$a" <=> -s "$dir/$b" } readdir $D1;

    If the number of files is high, asking for each file's size several times might slow the program significantly. Schwartzian transform should help.

    my @sDir = map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, -s "$dir/$_" ], readdir $D1;

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks o Wise Monks. I will try the suggestions. Here is my whole script so you can see how I am using the command. I should have put it here earlier
      More then just writing a script to find duplicates, I want to refresh my perl scripts on which I have lost touch over the last few years.

      use strict; use IO::File; use Digest::MD5 qw(md5); my ($aLen, $i, $j, $tFile, $sFile); my (@sDir,@tDir); my ($cFile, $ex1, $ex2); my ($cFile2, $chk1, $chk2); my ($fs1, $fs2); my($par1,$par2,$par3,$par4) = ($ARGV[0],$ARGV[1],$ARGV[2],$ARGV[3]) ; + # Expects a slash at the end if directory # par1 is the directory to Check # par2 is directory to check against # Exact dups are placed in ncn_cmp.bat to delete from first folder # Differences in ncn_diff.txt as either missing or different <br> $par1 = $par1 || ".\\"; $ex1 = $par2 || ".err"; $ex2 = $par3 || ".fmx"; $par4 = $par4 || "NCN"; $chk1 = "Apple"; open (OUT, ">ncn_cmp.bat"); open (DF, ">ncn_diff.txt"); open (SM, ">ncn_same.txt"); $tFile="XXX"; if (-d $par1) { opendir D1, $par1; #@tDir=sort (readdir D1); #@sDir=sort {-s $a <=> -s $b } @tDir; @sDir=sort {-s $a <=> -s $b } (readdir D1); $aLen = @sDir; for ($j=0;$j<$aLen;$j++){ next if !(-f $par1 . "\\" .$sDir[$j]); next if $sDir[$j] eq "ncn_cmp.bat"; next if $sDir[$j] eq "ncn_diff.txt"; next if $sDir[$j] eq "ncn_same.txt"; if ($par1 =~ s/\\$//g){ $sFile = $par1 . $sDir[$j]; } else { $sFile = $par1 . "\\" . $sDir[$j]; } #$sFile = $par1 . "\\" . $sDir[$j]; if ($tFile eq "XXX") { $tFile = $sFile; next; } $fs1 = -s $sFile; $fs2 = -s $tFile; if ($fs1 eq $fs2) { open(TST, "<", $tFile); $chk2 = md5(<TST>); close(TST); open(TST, "<", $sFile); $chk1 = md5(<TST>); close(TST); } else { # print $sFile . " size " . $fs1 . "\n"; # print $tFile . " size " . $fs2 . "\n"; $chk2 = "DIF"; } if ($chk1 eq $chk2) { print OUT "del \"" . $sFile . "\"\n"; print SM "echo N | comp " . $tFile . " " . $sFile . "\n"; } else { if ($chk2 eq "NCN") { print DF $tFile . " Not Found\n"; } else {<br> print DF $tFile . " and " . $sFile . " different\n"; } } $chk1 = "ABC"; $chk2 = "DEF"; $tFile = $sFile; } } print OUT "del ncn_diff.txt\n"; print OUT "del ncn_same.txt\n"; print OUT "del ncn_cmp.bat\n"; close(OUT); close(DF); close(SM);
        (1) When you want to post a chunk of code (or data) at the Monastery, start by typing these two lines into the composition box:



        Then paste your code (or data) into the space between those two tags; you won't need to muck with anything else in order to get the code (or data) to show up correctly when posted. (Don't forget to put your paragraphs of explanation outside the code tags.)

        2. Since you want to use file size to determine when to do md5 checksums, I think it would make more sense to build of a hash of arrays keyed by byte count: for each distinct byte count, the hash key is the size and the hash value is an array holding files of that size. Then loop over the hash and do md5s for each set of two or more files with a given size. You don't really need to do any sorting - just keep track of the different sizes. Here's how I would do it (on a unix/linux system):

        #!/usr/bin/perl use strict; use warnings; use Digest::MD5; die "Usage: $0 dir1 dir2\n" unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ); my %fsize; for my $dir ( @ARGV ) { opendir DIR, $dir or die "$dir: $!\n"; while ( my $fn = readdir DIR ) { next unless -f "$dir/$fn"; push @{$fsize{ -s "$dir/$fn" }}, "$dir/$fn"; } } my %fmd5; my $digest = Digest::MD5->new; for my $bc ( keys %fsize ) { next if scalar @{$fsize{$bc}} == 1; for my $fn ( @{$fsize{$bc}} ) { if ( open( my $fh, "<", $fn )) { $digest->new; $digest->addfile( $fh ); push @{$fmd5{ $digest->b64digest }}, $fn; } } } for my $md ( keys %fmd5 ) { print join( " == ", @{$fmd5{$md}} )."\n" if ( scalar @{$fmd5{$md}} + > 1 ); }
        (That just lists sets of files that have identical content; you can tweak it do to other things, as you see fit.)
      Thank you o Wise Ones.

      This worked perfectly.

Re: Sort directory by file size
by toolic (Bishop) on May 18, 2016 at 16:00 UTC
    This sorts all files in the current directory by size:
    use warnings; use strict; use File::Slurp qw(read_dir); my @files = sort { -s $a <=> -s $b } grep { -f } read_dir('./');

    It filters out directories (even . and ..). I'm not sure why you are getting that warning.

Re: Sort directory by file size
by Athanasius (Archbishop) on May 18, 2016 at 16:10 UTC

    Hello nnigam1, and welcome to the Monastery!

    I think the problem is that the -s file test returns undef when the file is empty (i.e., when it has a zero size). One way to fix this is to test for undef and change it to zero using the // (logical defined-or) operator (see perlop#Logical-Defined-Or):

    #! perl use strict; use warnings; opendir D1, ...; print "$_\n" for map { sprintf qq[%s: %d], $_, -s $_ // 0 } sort { (-s $a // 0) <=> (-s $b // 0) } grep { ! -d } readdir D1;

    Update: Marshall below is correct: -s returns undef only when the file does not exist. My problem was exactly as identified by choroba above: by failing to prepend the directory to the filename, I was calling -s on non-existent files. D’oh! :-(

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hi Athanasius! I tested the -s operator on my Windows system with a null file (zero length). I do get a numeric 0 for a null file. Here is the test code. I did verify that 'zero' is indeed of 0 length with file manager. I think -s returns undef if the file does not exist (which is different than exists, but empty).
      #!usr/bin/perl use warnings; use strict; `copy NUL zero`; #create empty file cp dev/null zero on unix? my $size = -s 'zero'; print $size; #does print "0" __END__ 0
      Update: FYI for Windows users who don't use the command line much... NUL is a reserved word in the Windows file system for "the bit bucket". Unix folks are familiar with this concept, but sometimes Windows users aren't. someprogram > NUL does not create a file called "NUL", this just throws away STDOUT and it goes nowhere, i.e., the "bit bucket". There is no file called "NUL". To make an empty file, I copied the "bit bucket" to a file.