http://www.perlmonks.org?node_id=85207


in reply to Find duplicate files.

Interesting. I just went through the similar problem of combining four computer's worth of archives. In some cases I had near duplicates due to slight doc changes and the like, so I wanted a bit more information. I had a second program do the deletes. (About 9,000 files)

I couldn't go by dates, due to bad file management

Note that the file statement uses the 3 arg version. I had some badly named files such as ' ha'. I wish I could remember the monk name that pointed out the documentation for me.

#!/usr/bin/perl # allstat.pl use warnings; use strict; use File::Find; use File::Basename; use Digest::MD5; my %hash; my @temp; while (my $dir = shift @ARGV) { die "Give me a directory to search\n" unless (-d "$dir"); File::Find::find (\&wanted,"$dir"); } exit; sub wanted { return unless (-f $_); my $md5; my $base = File::Basename::basename($File::Find::name, ""); my $size = -s "$base"; if ($size >= 10000000) { # They slowed down the check enough that I + skip them if ($size >= 99999999) { $size = 99999999; } $md5 = 'a'x32; # At this point I'll just hand check, less than a +dozen files } else { $md5 = md5file("$base"); } if ($File::Find::name =~ /\t/) { # Just in case, this screws up our +tab delimited file warn "'$File::Find::name' has tabs in it\n"; } printf("%32s\t%8d\t%s\t%s\n", $md5, $size, $File::Find::name, $base) +; } sub md5file { my ($file) = @_; unless (open FILE, "<", "$file" ) { warn "Can't open '$file': $!"; return -1; #Note we don't want to die just because of one file. } binmode(FILE); my $chksum = Digest::MD5->new->addfile(*FILE)->hexdigest; close(FILE); return $chksum; }