Re: Duplicate File Finder script reporting multiples

To avoid the duplication, the inner foreach loop needs to be moved to after the first (outer) foreach loop. This requires a change to the %md5 hash:

#! perl -w
use strict;
use File::Find;
use Digest::MD5;

local $|   = 1;
my $path   = $ARGV[0];

print "Searching for duplicate files in $path\n";
find(\&check_file, $path);

local $"   = '';
my %files;
my %md5;
my $wasted = 0;
my $size   = 0;

for my $size (sort {$b <=> $a} keys %files)
{
    next unless   @{ $files{$size} } > 1;

    for my $file (@{ $files{$size} })
    {
        open(FILE, $file) or next;
        binmode(FILE);
        my $key = Digest::MD5->new->addfile(*FILE)->hexdigest;
        $md5{$key}{size} = $size;
        push @{ $md5{$key}{files} }, $file . "\n";
    }
}

for my $hash (keys %md5)
{
    next unless @{ $md5{$hash}{files} } > 1;
    print "\n@{$md5{$hash}{files}}";
    my $s = $md5{$hash}{size};
    print "File size $s\n";
    $wasted += $s * (@{ $md5{$hash}{files} } - 1);
}

$wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/;
print "\n$wasted bytes in duplicated files\n";

sub check_file
{
    (my $fn = $File::Find::name) =~ tr#/#\\#;
    -f && push @{ $files{ (stat(_))[7] } }, $fn;
}
[download]

Example output:

17:56 >perl 1201_SoPW.pl .
Searching for duplicate files in .

.\fox1.txt
.\fox2.txt
File size 52

.\party1.txt
.\party2.txt
.\party3.txt
File size 65

182 bytes in duplicated files

17:56 >
[download]

Notes:

This also fixes the wasted bytes total.
There is no need for a while loop in re-formatting $wasted — as shown by the absence of a /g modifier on the original substitution regex.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re: Duplicate File Finder script reporting multiples Select or Download Code

Replies are listed 'Best First'.
Re^2: Duplicate File Finder script reporting multiples by GotToBTru (Prior) on Mar 30, 2015 at 14:32 UTC
I wondered why 1) the use of %files in the check_file did not cause an uninitialized variable error and 2) why the variable declaration a few lines after did not wipe out the results of the find(). I ran the program in debug and tried to inspect the contents of %files after the function call, and I get "empty array". I inspect it again right after the variable declaration, and it is fully populated. I'm guessing this is some sort of symbol table magic - can someone explain? Dum Spiro Spero	[reply]
Re^3: Duplicate File Finder script reporting multiples by choroba (Cardinal) on Mar 30, 2015 at 14:39 UTC
Ad 1) The subroutine check_file is declared in the scope where %files is declared. If you dereference an undefined value that's not read only, it autovivifies. Ad 2) The declaration doesn't change the value. The assignment would - try changing it to `my %files = ();` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Duplicate File Finder script reporting multiples by Anonymous Monk on Mar 30, 2015 at 09:42 UTC
Thankyou very much Sir, I'll give it a go when I'm back in the office.	[reply]


go ahead... be a heretic
	PerlMonks