Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Duplicate File Finder script reporting multiples

by Athanasius (Archbishop)
on Mar 30, 2015 at 08:00 UTC ( [id://1121779]=note: print w/replies, xml ) Need Help??


in reply to Duplicate File Finder script reporting multiples

To avoid the duplication, the inner foreach loop needs to be moved to after the first (outer) foreach loop. This requires a change to the %md5 hash:

#! perl -w use strict; use File::Find; use Digest::MD5; local $| = 1; my $path = $ARGV[0]; print "Searching for duplicate files in $path\n"; find(\&check_file, $path); local $" = ''; my %files; my %md5; my $wasted = 0; my $size = 0; for my $size (sort {$b <=> $a} keys %files) { next unless @{ $files{$size} } > 1; for my $file (@{ $files{$size} }) { open(FILE, $file) or next; binmode(FILE); my $key = Digest::MD5->new->addfile(*FILE)->hexdigest; $md5{$key}{size} = $size; push @{ $md5{$key}{files} }, $file . "\n"; } } for my $hash (keys %md5) { next unless @{ $md5{$hash}{files} } > 1; print "\n@{$md5{$hash}{files}}"; my $s = $md5{$hash}{size}; print "File size $s\n"; $wasted += $s * (@{ $md5{$hash}{files} } - 1); } $wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/; print "\n$wasted bytes in duplicated files\n"; sub check_file { (my $fn = $File::Find::name) =~ tr#/#\\#; -f && push @{ $files{ (stat(_))[7] } }, $fn; }

Example output:

17:56 >perl 1201_SoPW.pl . Searching for duplicate files in . .\fox1.txt .\fox2.txt File size 52 .\party1.txt .\party2.txt .\party3.txt File size 65 182 bytes in duplicated files 17:56 >

Notes:

  • This also fixes the wasted bytes total.
  • There is no need for a while loop in re-formatting $wasted — as shown by the absence of a /g modifier on the original substitution regex.

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Duplicate File Finder script reporting multiples
by GotToBTru (Prior) on Mar 30, 2015 at 14:32 UTC

    I wondered why 1) the use of %files in the check_file did not cause an uninitialized variable error and 2) why the variable declaration a few lines after did not wipe out the results of the find(). I ran the program in debug and tried to inspect the contents of %files after the function call, and I get "empty array". I inspect it again right after the variable declaration, and it is fully populated. I'm guessing this is some sort of symbol table magic - can someone explain?

    Dum Spiro Spero
      Ad 1) The subroutine check_file is declared in the scope where %files is declared. If you dereference an undefined value that's not read only, it autovivifies.

      Ad 2) The declaration doesn't change the value. The assignment would - try changing it to

      my %files = ();
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re^2: Duplicate File Finder script reporting multiples
by Anonymous Monk on Mar 30, 2015 at 09:42 UTC

    Thankyou very much Sir, I'll give it a go when I'm back in the office.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1121779]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-23 20:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found