Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Find Duplicate Files

by salvadors (Pilgrim)
on Jan 04, 2001 at 23:08 UTC ( #49819=sourcecode: print w/ replies, xml ) Need Help??

Category: Utility Scripts
Author/Contact Info Tony / tony@tmtm.com
Description: As my original Find Duplicate Files script was so popular I decided to take the advice of turning it into a module. Here's the initial verison of it. I'd appreciate feedback on ways to provide a nicer, more useful, interface than just returning a HoL.

Thanks,

Tony

package File::Find::Duplicates;

=head1 NAME

File::Find::Duplicates - Find duplicate files

=head1 SYNOPSIS

  use File::Find::Duplicates;

  my %dupes = find_duplicate_files('/basedir');

  local $" = "\n  ";
  foreach my $filesize (keys %dupes) {
    print "Duplicate files of size $filesize:\n  @{$dupes{$filesize}}\
+n";
  }

=head1 DESCRIPTION

This module provides a way of finding duplicate files on your system.

When passed a base directory (or list of such directories) it returns
a hash, keyed on filesize, of lists of the identical files of that siz
+e.

=head1 TODO

Provide some much more useful interfaces to this.

=head1 AUTHOR

Tony Bowden, tony@blackstar.co.uk

=head1 SEE ALSO

File::Find

=cut

use vars qw($VERSION @ISA @EXPORT %files);

require Exporter;

@ISA     = qw/Exporter/;
@EXPORT  = qw/find_duplicate_files/;
$VERSION = '0.02';

use strict;
use File::Find;
use Digest::MD5;

sub check_file {
  -f && push @{$files{(stat(_))[7]}}, $File::Find::name;
}

sub find_duplicate_files {
  my %dupes;
  find(\&check_file, shift || ".");
  foreach my $size (sort {$b <=> $a} keys %files) {
    next unless @{$files{$size}} > 1;
    my %md5;
    foreach my $file (@{$files{$size}}) {
      open(FILE, $file) or next;
      binmode(FILE);
      push @{$md5{Digest::MD5->new->addfile(*FILE)->hexdigest}},$file;
    }
    foreach my $hash (keys %md5) {
      push @{$dupes{$size}}, @{$md5{$hash}} 
        if (@{$md5{$hash}} > 1);
    }
  }
  return %dupes;
}

"dissolving ... removing ... there is water at the bottom of the ocean
+";

Comment on Find Duplicate Files
Download Code
Re: Find Duplicate Files
by merlyn (Sage) on Jan 04, 2001 at 23:48 UTC
    I don't like the package %files there. You can get rid of it by passing a closure to File::Find...
    sub find_duplicate_files { my %dupes; my %files; find sub { -f && push @{$files{(stat(_))[7]}}, $File::Find::name; }, shift || "."; .... }
    Much cleaner. Easier to maintain. Smaller locality of reference. Blah blah blah. {grin}

    -- Randal L. Schwartz, Perl hacker

Re: Find Duplicate Files
by kingman (Scribe) on Jul 04, 2002 at 20:56 UTC
    Hi, I wrote a command-line utility using your module that makes is easy to delete duplicate files.
    #!/usr/bin/perl -w use strict; use File::Find::Duplicates; $|++; # AutoFlush the Buffer &usage if $#ARGV eq '-1'; my %dupes = find_duplicate_files(@ARGV); die "No duplicates found!\n" unless keys %dupes; print "############ Duplicate File Report & Removal Utility ########## +##\n"; my $i = 1; foreach my $fsize (keys %dupes) { print "#" x 64 . " " . $i++ . "\n"; print map {-l $_ ? "# push \@delete, '$_'; # symlinked to " . read +link($_) . "\n": "# push \@delete, '$_';\n"} @{ $dupes{$fsize} }; print "\n"; } print "unlink \@delete;\n"; sub usage { (my $script_name = $0) =~ s#.*/##; # $0 = full path to script print <<END; Generates a Report on Duplicate Files. Usage: $script_name [List of Directories] END exit } ### POD ### =head1 Name dupes - a command line utility to report on all duplicate files, even +if they have different names. This is good for mp3s and multiple drafts of do +cuments that may have been backed up in different places. =head1 Synopsis dupes [list of directories to search recursively] =head1 From an empty buffer in Vim The following commands will fill the buffer with a report of all dupli +cate files. :%!dupes [list of directories] B<or> !!dupes [list of directories] The report generated by the above commands is yet another perl script +that can be edited allowing you to flag certain files for removal. The following command will run the report and remove all flagged files +. :%!perl Nothing is deleted unless you flag the file by uncommenting the line. If you don't understand how the report works, the following commands s +hould explain it. perldoc -f push perldoc -f unlink =head1 AUTHOR Kingsley Gordon, E<lt>kingman@ncf.caE<gt> last modified: Thu Jul 4 15:11:26 EDT 2002 =cut
      It would be nice if the script deleted the duplicates but later created a hard link from the original to the deleted file. That way you don't waste any space and you have no risk of breaking anything.

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://49819]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (14)
As of 2014-12-19 13:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (82 votes), past polls