Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Perl Sensitive Sunglasses
 
PerlMonks

Remove Duplicate Files

by jfroebe (Vicar)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Oct 29, 2004 at 02:38 UTC ( #403580=sourcecode: print w/ replies, xml ) Need Help??

Category: Utility Scripts
Author/Contact Info

Jason L. Froebe
jason@froebe.net

Description:

Searches a list of directories provided on the command line and removes duplicates. It remembers previous runs (compressed delimited file) and is able to remove 'cache' entries that point to nonexistant files.

A summary is printed

Loaded 93031 entries

TOTAL files: 93030
Added files: 0
Deleted files: 0
Files not found: 0

#!/usr/bin/perl

use warnings;
use strict;

use Digest::MD5;
use File::Find;
use PerlIO::gzip;

use vars qw/ $md5_file_ref $filename_md5_ref 
         $file_count $file_add $file_del 
         $file_lost /;

sub find_found;
sub load_md5 ($);
sub save_md5 ($);

sub save_md5 ($) {
 my $FILENAME = shift;

 my $FILE = ();
 open $FILE, ">:gzip", $FILENAME;

 foreach (keys %$md5_file_ref) {
   printf $FILE "%s|||%s\n", $_, $md5_file_ref->{$_};
 }
 close $FILE;
}

sub load_md5 ($) {
 my $FILENAME = shift;

 my $FILE = ();
 my $count = 1;

 open $FILE, "<:gzip", $FILENAME;

 while (<$FILE>) {
   chomp;
   my ($tmp_col1, $tmp_col2) = split '\|\|\|', $_;
   $md5_file_ref->{$tmp_col1} = $tmp_col2;
   $filename_md5_ref->{$tmp_col2} = $tmp_col1;
   $count++;
 }
 close $FILE;
 print "Loaded $count entries\n";
}

sub verify_files {
 foreach my $FILE (keys %$filename_md5_ref) {
   unless (-f $FILE) {
    my $md5 = $filename_md5_ref->{$FILE};
    delete $md5_file_ref->{$md5};
    delete $filename_md5_ref->{$FILE};
    print "   *** Not found: $FILE\n";
    $file_lost++;
   }
 }
}

sub find_found {
 my $FILE = $_;

 my $file_md5 = ();

 if ( -r $FILE && -f $FILE) {
   unless ($filename_md5_ref->{$FILE}) {
     open(FILE, $FILE)
       or return;
     binmode(FILE);
     $file_md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
     close(FILE);
     if ( $md5_file_ref->{$file_md5}) {
       chmod(0666, $FILE);
       if (unlink $FILE) {
         print "\n",
           "     *** DELETING ***\n",
           "   Duplicate file: $FILE\n",
           "     *** DELETING ***\n\n";
         $file_del++;
       } else {
         warn "Unable to delete $FILE\n\n";
       }
      } else {
       print "Added $file_md5 $FILE\n";
       $md5_file_ref->{$file_md5} = $FILE;
       $filename_md5_ref->{$FILE} = $file_md5;
       $file_add++;
     }
  }
  $file_count++;
 }
}

#####################
#####################
#####################

$file_count = 0;
$file_del = 0;
$file_add = 0;
$file_lost = 0;

my $FILE = "dups.csv.gz";

if ( -r $FILE && -f $FILE) {
 load_md5($FILE);
 verify_files;
}

find {
 bydepth         => 1,
 no_chdir        => 1,
 wanted          => \&find_found
} => @ARGV;

print "\nTOTAL files: $file_count\n";
print "  Added files: $file_add\n";
print "  Deleted files: $file_del\n";
print "  Files not found: $file_lost\n\n";
save_md5($FILE);

Comment on Remove Duplicate Files
Download Code
Re: Remove Duplicate Files
by jfroebe (Vicar) on Oct 29, 2004 at 02:47 UTC

    I guess you could run "dups.pl /home/jason /home/jason" in lieu of "rm -rf /home/jason"... ;-)

    Jason L. Froebe

    Team Sybase member

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

[reply]
Re: Remove Duplicate Files
by gaal (Parson) on Oct 29, 2004 at 07:07 UTC
    MD5 collisions are rare, but they can happen. If you want to be really safe, your storage should not just keep track of seen hashes; it should make them the key of a list of files that have those hashes. Then when you detect a seen hash, you should byte-compare the new file with all the existing files on that list.

    This, of course, is slower, adds complexity, and will rarely be useful; but personally, I want code that deletes files automatically to be correct!

[reply]
      Agreed. You can make it a lot more efficient by stat()ing all the files and only bothering to compare the contents of those which are the same size. Another small improvement can come from noting that those with the same device number and inode number are guaranteed to be the same so no need to compare their contents, although this may not be portable to non-Unixy platforms.

      You should also be careful about how you compare symlinks and device files.

[reply]
[d/l]
        Then again, hardlinks are less of a concern for cleanup, because they don't waste disk space.
[reply]
        And further improvement can be made by reading in just the first 1024 bytes or so, and calculate the md5 from that. Only if those match, you do a full comparison.
[reply]
Re: Remove Duplicate Files
by ihb (Deacon) on Oct 30, 2004 at 20:25 UTC
[reply]

Back to Code Catacombs

Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: sourcecode [id://403580]
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others exploiting the Monastery: (10)
GrandFather
atcroft
herveus
Eyck
biohisham
Haarg
matze77
lamprecht
gnosti
im2
As of 2009-11-21 09:02 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (729 votes), past polls