Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Find duplicate files with exact same files noted

by Lady_Aleena (Deacon)
on Aug 17, 2010 at 06:18 UTC ( #855401=CUFP: print w/ replies, xml ) Need Help??

When I wrote this script, I didn't think to look for a module that did the same thing. However, I am pretty happy with the way this one turned out.

This script will look through a directory and return a list of all the files with the same filename and byte size with all of the files with the same content noted. A few user modifications will be needed to set an optional base directory or exclude certain files.

#!/usr/bin/perl use strict; use warnings; use File::Compare; use File::Find; #If you want to set a base_directory, you can do so here. my $base_directory; print "What directory? "; my $directory = <>; chomp $directory; my @file_list; sub files_wanted { my $text = $File::Find::name; if ( -f ) { push @file_list, $text; } } #If you set a base directory above, you will need to change $directory to $base_directory.$directory. find(\&files_wanted,$directory); #This section creates a hash of arrays of files, with the hash keys be +ing filename.ext and the file size in #parentheses. The raw file name is the entire path including the file +name. my %files; for my $raw_file (@file_list) { my @file_parts = split(/\//,$raw_file); my $file = pop @file_parts; my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; } #This section searches the hash for any file with 2 or more files whic +h share the same filename.ext and size. #After that, it compares all of the files with those attributes to det +ermine if they share the same contents. #It will print the list of files with the same filename and size and w +ill tell you which ones share the same #contents. for my $file (sort keys %files) { if (@{$files{$file}} > 1) { my $amount = @{$files{$file}}; print "$file\t\t$amount\n"; for my $location1 (@{$files{$file}}) { print "\t$location1\n"; for my $location2 (@{$files{$file}}) { unless ($location1 eq $location2) { if (compare($location1,$location2) == 0) { print "\t\tExact copy: $location2\n"; } } } } print "\n"; } }

Update: This is my 100th write-up.

Have a cookie and a very nice day!
Lady Aleena

Comment on Find duplicate files with exact same files noted
Download Code
Re: Find duplicate files with exact same files noted
by BioLion (Curate) on Aug 17, 2010 at 08:03 UTC

    Minor suggestion:

    Rather than prompting users for a directory (prone to typos etc...) you could either take it from the @ARGV commandline input (if they have autocomplete) or use a file select GUI like Tk::FileSelect or Tk::chooseDirectory or Tk::Getopt. You could even do both and only throw out the selection dialog box if the user fails to give any commandline input?

    Just a something something...
Re: Find duplicate files with exact same files noted
by wwe (Friar) on Aug 17, 2010 at 13:22 UTC
    maybe you want to use File::Spec to split file names, as it should be portable and more reliable. from the docs:
    ($volume,$directories,$file) = File::Spec->splitpath( $path ); ($volume,$directories,$file) = File::Spec->splitpath( $path, $no_f +ile );

      So, using File::Spec the following would need to be altered.

      my %files; for my $raw_file (@file_list) { my @file_parts = split(/\//,$raw_file); my $file = pop @file_parts; my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; }

      The following is the alteration.

      my %files; for my $raw_file (@file_list) { my ($volume,$directories,$file) = File::Spec->splitpath($raw_file); my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; }

      So, no more spliting and popping for the file name. Thanks for the tip!

      Have a cookie and a very nice day!
      Lady Aleena
Re: Find duplicate files with exact same files noted
by jwkrahn (Monsignor) on Aug 17, 2010 at 18:29 UTC
    my @file_list; sub files_wanted { my $text = $File::Find::name; if ( -f ) { push @file_list, $text; } } #If you set a base directory above, you will need to change $directory to $base_directory.$directory. find(\&files_wanted,$directory); #This section creates a hash of arrays of files, with the hash keys be +ing filename.ext and the file +size in #parentheses. The raw file name is the entire path including the file +name. my %files; for my $raw_file (@file_list) { my @file_parts = split(/\//,$raw_file); my $file = pop @file_parts; my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; }

    Why tranverse the directory tree twice (and stat each file twice) when you only have to traverse it once:

    my %files; find sub { if ( -f ) { push @{ $files{ "$_ (" . ( -s _ ) . " bytes)" } }, $File::Find +::name; } }, $directory;

      Wow! I didn't realize that I was traversing the tree twice until you said something. Maybe that is why it took a little while to run. I didn't use your exact suggestion, but I did merge the two pieces into one.

      This ...

      my @file_list; sub files_wanted { my $text = $File::Find::name; if ( -f ) { push @file_list, $text; } } find(\&files_wanted,$directory); my %files; for my $raw_file (@file_list) { my @file_parts = split(/\//,$raw_file); my $file = pop @file_parts; my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; }

      .. is now this ...

      my %files; sub files_wanted { my $raw_file = $File::Find::name; if ( -f ) { my ($volume,$directories,$file) = File::Spec->splitpath($raw_file) +; #update from a prior suggestion. my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; } } find(\&files_wanted,$directory);

      The script now runs a little faster since removing the double traversal of the directory tree. Thanks for showing me what I was really doing!

      Have a cookie and a very nice day!
      Lady Aleena
        my %files; sub files_wanted { my $raw_file = $File::Find::name; if ( -f ) { my ($volume,$directories,$file) = File::Spec->splitpath($raw_file) +; #update from a prior suggestion. my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; } }

        While you are in the "wanted" subroutine that File::Find::find runs, the full path is in the $File::Find::name variable and the file name only is in the $_ variable so there is no need to use File::Spec->splitpath() to do something that File::Find::find has already done for you.    Also, you are still using stat on the same file twice when it would be more efficient to do it only once.

Re: Find duplicate files with exact same files noted
by jwkrahn (Monsignor) on Aug 17, 2010 at 22:47 UTC
    for my $location1 (@{$files{$file}}) { print "\t$location1\n"; for my $location2 (@{$files{$file}}) { unless ($location1 eq $location2) {

    Why are you looping over the full path twice and comparing them?    The only way you could get duplicate entries is if your file system is serverely broken or if you have the exact same file in both variables.

      Here is some sample output to show you what is going on.

      In the following example, there are two files with the exact same file name and size in two different directories, but they do not have the same contents.

      season_01.txt (234 bytes) 2 C:/Documents and Settings/ME/My Documents/fantasy/files/data/M +ovies/Episode_lists/Leverage/season_01.txt C:/Documents and Settings/ME/My Documents/fantasy/files/data/M +ovies/Episode_lists/Tremors/Tremors_The_Series/season_01.txt

      In this next example, there are two files with the exact same file name and size in two different directories, and they do have the same contents.

      logo.gif (4060 bytes) 2 C:/Documents and Settings/ME/My Documents/My Pictures/other/We +bLayout/graphics/logo.gif Exact copy: C:/Documents and Settings/ME/My Documents/ +My Pictures/other/WebLayout/logos/logo.gif C:/Documents and Settings/ME/My Documents/My Pictures/other/We +bLayout/logos/logo.gif Exact copy: C:/Documents and Settings/ME/My Documents/ +My Pictures/other/WebLayout/graphics/logo.gif

      Here there are six files with the same file name and size in different directories. Some do have the same contents, but some do not have the same contents.

      desktop.ini (345 bytes) 6 C:/Documents and Settings/ME/My Documents/My Music/Depeche Mod +e/101 Disc 1/desktop.ini Exact copy: C:/Documents and Settings/ME/My Documents/ +My Music/Depeche Mode/101 Disc 2/desktop.ini C:/Documents and Settings/ME/My Documents/My Music/Depeche Mod +e/101 Disc 2/desktop.ini Exact copy: C:/Documents and Settings/ME/My Documents/ +My Music/Depeche Mode/101 Disc 1/desktop.ini C:/Documents and Settings/ME/My Documents/My Music/Duran Duran +/Greatest/desktop.ini C:/Documents and Settings/ME/My Documents/My Music/Prince & th +e Revolution/Purple Rain/desktop.ini C:/Documents and Settings/ME/My Documents/My Music/Queen/Live +Killers Disc 1/desktop.ini Exact copy: C:/Documents and Settings/ME/My Documents/ +My Music/Queen/Live Killers Disc 2/desktop.ini C:/Documents and Settings/ME/My Documents/My Music/Queen/Live +Killers Disc 2/desktop.ini Exact copy: C:/Documents and Settings/ME/My Documents/ +My Music/Queen/Live Killers Disc 1/desktop.ini

      This output is exactly what I wanted. I don't have many examples left, since I already went through and deleted most of the duplicate files. I just have to make a few more decisions for the rest of them.

      Addition: I had about half a dozen copies of the same two image files all over My Documents.

      Have a cookie and a very nice day!
      Lady Aleena
Re: Find duplicate files with exact same files noted
by Tux (Monsignor) on Aug 18, 2010 at 06:22 UTC

    Nice. Instead of File::Compare, I use Digest::MD5 or similar, and I don't care about the file name that much, as I also want to find duplicate binary files with different names, as for MP3 or JPG files. To find duplicate files:

    $ cp tshirt.jpg duplicate.image $ dups.pl I've MD5'd 191 files to 191 checksums ./3.jpg ./image00111.jpg ./4.jpg ./image00222.jpg ./duplicate.image ./tshirt.jpg $ dups.pl -q image ./duplicate.image ./image00111.jpg ./image00222.jpg ./image00222surf.jpg ./image00333.jpg ./image00554.jpg ./image00665.jpg ./image00776.jpg ./image00887.jpg ./image009.jpg ./image00998.jpg ./image010109.jpg ./image011.jpg ./image0121210.jpg $
    #!/pro/bin/perl use strict; use warnings; use Digest::MD5 qw( md5_hex ); use DB_File; use File::Find; use Getopt::Long qw(:config bundling nopermute); my $opt_q = 0; # Query the database GetOptions ( "q" => \$opt_q, ) or die "usage: dups.pl [-q]\n"; my %sum; tie my %md5, "DB_File", "dups.md5"; if ($opt_q) { my @db = sort keys %md5; untie %md5; foreach my $pat (@ARGV) { print "$_\n" for grep m/$pat/i => @db; } exit; } my $nfile = 0; find (sub { if (-d and -f "$_/dups.md5") { tie my %d5, "DB_File", "$_/dups.md5"; foreach my $f (keys %d5) { $md5{"$File::Find::name/$f"} //= $d5{$f}; } untie %d5; } -f or return; (my $f = $File::Find::name) =~ s:^_new/::; printf STDERR " %6d %-70.70s\r", ++$nfile, $f; if (exists $md5{$f}) { push @{$sum{$md5{$f}}}, $f; return; } local $/; open my $p, "< $_" or die "$f: $!\n"; my $sum = md5_hex (<$p>); push @{$sum{$md5{$f} = $sum}}, $f; }, sort glob "*"); print STDERR "I've MD5'd $nfile files to ", scalar keys %md5, " checks +ums\n"; open STDOUT, "| sort"; foreach my $r (values %sum) { my @p = @$r; @p > 1 or next; $p[0] =~ m{(?:^|/)\d+/} and @p = map { $_->[0] } sort { $a->[1] <=> $b->[1] or $a->[2] <=> $b->[2] or $a-> +[0] cmp $b->[0] } map { [ $_, (m/(\d+)\b/g), 0, 0, 0 ] } @p; print join "\t", @p; print "\n"; } close STDOUT;

    Enjoy, Have FUN! H.Merijn

      Hi, I wrote a similar script using the MD5 hash for detecting duplicates. It is not cleaned up or optimized - but does the intended job. It is now in my tool collection.

      #!/usr/bin/perl # # Find duplicate files in specified directories using md5sum values to # identify duplicates. # # (C) 2009 S.M.Mahesh use strict; use warnings; use File::Find; use Digest::MD5; my $version = 0.1; my %md5sums; my $md5 = Digest::MD5->new(); sub Usage() { print<<USAGEDOC; $0 v$version - FindDuplicate script USAGE: $0 <DIR1> [DIR2...DIRn] where, DIR1..DIRn Specifies the directories to search EXAMPLE: $0 /home/user/downloads /home/user/documents USAGEDOC exit 1; } sub wanted { return unless -f $File::Find::name; # Return if it is not a plain +file return if -l $File::Find::name; # Return in case this is a symlink if (open(FILE, $File::Find::name) ) { binmode(FILE); my $sum = $md5->addfile(*FILE)->hexdigest(); close(FILE); my $aref = $md5sums{$sum}; if ( defined $aref ) { push @$aref, $File::Find::name; } else { my @list = ($File::Find::name); $md5sums{$sum} = \@list; } } else { print "ERROR: Could not open '$File::Find::name' for reading\n +"; } return; } Usage() if( $#ARGV < 0 ); foreach my $dir (@ARGV) { print "$dir \n"; unless ( -d $dir ) { print "ERROR: '$dir' is not a valid directory\n"; next; } find(\&wanted, $dir); } print "\n", '-'x25, "\n"; print "Printing duplicate files (if any)\n"; print '-'x25, "\n\n"; foreach my $sum (sort keys %md5sums) { my $list = $md5sums{$sum}; if ($#$list > 0) { print "$sum :\n"; foreach my $file (@$list) { print "\t $file\n"; } print"\n" } } print '-'x25, "\n";

      Mahesh

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://855401]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2014-12-21 05:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (103 votes), past polls