To avoid the duplication, the inner foreach loop needs to be moved to after the first (outer) foreach loop. This requires a change to the %md5 hash:
#! perl -w
use strict;
use File::Find;
use Digest::MD5;
local $| = 1;
my $path = $ARGV[0];
print "Searching for duplicate files in $path\n";
find(\&check_file, $path);
local $" = '';
my %files;
my %md5;
my $wasted = 0;
my $size = 0;
for my $size (sort {$b <=> $a} keys %files)
{
next unless @{ $files{$size} } > 1;
for my $file (@{ $files{$size} })
{
open(FILE, $file) or next;
binmode(FILE);
my $key = Digest::MD5->new->addfile(*FILE)->hexdigest;
$md5{$key}{size} = $size;
push @{ $md5{$key}{files} }, $file . "\n";
}
}
for my $hash (keys %md5)
{
next unless @{ $md5{$hash}{files} } > 1;
print "\n@{$md5{$hash}{files}}";
my $s = $md5{$hash}{size};
print "File size $s\n";
$wasted += $s * (@{ $md5{$hash}{files} } - 1);
}
$wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/;
print "\n$wasted bytes in duplicated files\n";
sub check_file
{
(my $fn = $File::Find::name) =~ tr#/#\\#;
-f && push @{ $files{ (stat(_))[7] } }, $fn;
}
Example output:
17:56 >perl 1201_SoPW.pl .
Searching for duplicate files in .
.\fox1.txt
.\fox2.txt
File size 52
.\party1.txt
.\party2.txt
.\party3.txt
File size 65
182 bytes in duplicated files
17:56 >
Notes:
- This also fixes the wasted bytes total.
- There is no need for a while loop in re-formatting $wasted — as shown by the absence of a /g modifier on the original substitution regex.
Hope that helps,