Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Scanning for duplicate files

by Amoe (Friar)
on Sep 12, 2001 at 21:10 UTC ( [id://111956]=perlquestion: print w/replies, xml ) Need Help??

Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Hey. I'm currently making a script to read a directory, store the file size and name, scan for any duplicates and delete one of the duplicate pair. I came up with this code, which is supposed to print out the sizes of duplicates (very rough code):
#!/usr/bin/perl use strict; use warnings; my $dir = shift(); my @images; opendir(DIR, $dir) or die("Couldn't open dir $dir: $!"); foreach my $file_found (readdir(DIR)) { my %image; $image{name} = $file_found; $image{size} = (stat("$dir/$file_found"))[7]; push @images, \%image; } closedir(DIR); my $previous = 'bo mix'; my @duplicates = grep $_ eq $previous && ($_ = %{$_}) && ($_ = $_{size +}) && ($previous = $_), @images; print join ', ', @duplicates;
That produces no output on a dir which I know contains files of duplicate sizes. Not surprising, considering that that grep feels absolutely horrible. But is there a way to do that horrible grep in a better (and working) way? Because I think the code's okay up to there.

Replies are listed 'Best First'.
Re: Scanning for duplicate files
by demerphq (Chancellor) on Sep 12, 2001 at 23:08 UTC
    Hi Amoe.

    This isnt a direct answer to your question but I have had this problem before and wrote the following to sort it out.

    It uses MD5 signatures to find duplicates regardless of the filename and size or time. Its pretty fast as well.

    Oh, its a bit primitive, sorry, I wrote it soon after I started learning perl.

    use warnings; use strict; use Digest::MD5; use File::Find; $|=1; #Autoflush ON! my @list; my %dupes; my @delete; my %digests; my $ctx = Digest::MD5->new; sub check_file { my $file=shift; $ctx->reset; open FILE,$file || die "Cant open $file!\n"; binmode FILE; $ctx->addfile(*FILE); close FILE; my $digest = $ctx->hexdigest; if (exists($digests{$digest})) { print "\t$file is a dupe!\n"; $dupes{$digest}->{$file}=1; push @delete,$file; } else { $digests{$digest}=$file; } } #CHANGE ME!!! my $path='D:/Development/Perl/'; print "I am going to look for duplicates starting at ".$path."\n"; find({wanted=>sub{if (-f $_) {check_file($_)} else {print "Searching $_\n"}}, no_chdir=>1},$path); print "There are ".@delete." duplicate files to delete.\n"; # Uncomment the below line to lose the duplicates! # print "Deleted ".unlink(@delete)." files!";

    Yves
    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re: Scanning for duplicate files
by chromatic (Archbishop) on Sep 12, 2001 at 21:46 UTC
    Push each file name onto an anonymous array keyed on size, then look for hash keys with multiple values. Something like:
    my %files; foreach my $file_found (readdir(DIR)) { # next if $file_found =~ /^\.{1,2}\z/; my $size = (stat("$dir/$file_found"))[7]; $files{$size} ||= []; push @{ $files{$size} }, $file_found; } foreach my $same_size (values %files) { next if @$same_size == 1; print join(', ', @$same_size); }
    Untested. Are you feeling lucky?
Re: Scanning for duplicate files
by kjherron (Pilgrim) on Sep 12, 2001 at 21:57 UTC
    If you're just looking for files having the same size, this example may be sufficient.
    ... my %image; opendir(DIR, $dir) or die("Couldn't open dir $dir: $!"); foreach my $file (readdir(DIR)) { my $size = -s "$dir/file_found"; # Simpler size code if (exists($image{$size})) { handle_duplicate($image{$size}, $file); } else { $image{$size} = $file; } }
    If you want to do something more sophisticated, you may find it easier to build a list of potential duplicates and then postprocess each list in a second loop. The following code will build a hash of arrays, each array containing the filenames that had the same size. It then calls a checking function on each list containing more than one file name:
    ... foreach my $file (readdir(DIR)) { my $size = -s "$dir/$file"; push @{$image{$size}}, $file; } closedir(DIR); foreach my $list (values %image) { handle_duplicates(@{$list}) if (@{$list} > 1); }
Re: Scanning for duplicate files
by Zaxo (Archbishop) on Sep 12, 2001 at 23:52 UTC

    I think size is too loose a selector for duplicate files. Coincidence is more likely than you may think for a collection of files with common format, stereotyped content, or small size. Since you want to unlink dupes, it would be advisable to play safe.

    An md5 digest is a better indicator. Here is one way to use it:

    my %cksums; push @{$cksums{`md5sum $_`}}, $_ for glob($dir/*); unlink( splice @{$cksums{$_}}, 1) || die $! for keys %cksums;
    This is fairly idiomatic. The first two statements construct a hash of arrays. The arrays contain filenames duplicates, indexed by checksum. For each distinct md5 digest, we unlink the list of extra files pruned by splice.

    After Compline,
    Zaxo

Re: Scanning for duplicate files
by hopes (Friar) on Sep 13, 2001 at 08:54 UTC
    If your code is correct until the 'grep' (and I think is correct), you should have an array of hashes like this:
    @images = ({Name,'a',Size,2},{Name,'b',Size,3},{Name,'c',Size,2},{Name +,'d',Size,3});
    You can verify which files have the same size as others (the list of duplicate files) with this grep:
    my @duplicates = grep{$c{$_->{Size}}++ && $c{$_->{Size}}>1}@images;
    I consider that a file is duplicated when I have checked another with the same size. (I check only the size).
    and this is only to check:
    for (@duplicates) { print join " ", values %$_, "\n"; }
    Note that I've changed 'name' for 'Name' and 'size' for 'Size' because -w warns
    'Unquoted string "name" may clash with future reserved word '

    I hope it can help you
    Hopes

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://111956]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-28 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found