Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Recursive search for duplicate files

by props (Hermit)
on Nov 27, 2007 at 13:27 UTC ( [id://653215]=perlquestion: print w/replies, xml ) Need Help??

props has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks I write a script which aims to search through a directory tree and report duplicate folders and files. At the moment I'm stuck and i hope you to point me to the right direction. I could straggle with the syntax as long as i have a sort of pseudo code:
use warnings; use strict; use File::Find; my $directory = '/mnt/music/very-good'; find (\&wanted, $directory); my %path_file; sub wanted { my $path = $File::Find::dir; my $filename = $File::Find::name; $hash{$path} = $filename; } my %count; while (my ($key , $value) = each(%path_file)) { $count{$key} +=1; }

Replies are listed 'Best First'.
Re: Recursive search for duplicate files
by moritz (Cardinal) on Nov 27, 2007 at 13:34 UTC
    What are "duplicate" files for you? Files with the same basename? Or with with identical contents?

    It seems that you are using too many hashes, that are initialized nowhere: %hash is used but never declared, %path_file used and neither declared nor initialized.

Re: Recursive search for duplicate files
by hawtin (Prior) on Nov 27, 2007 at 14:00 UTC
Re: Recursive search for duplicate files
by sh1tn (Priest) on Nov 27, 2007 at 13:35 UTC
    Much better way is to use MD5 for file comparison.


      If used naively, that doesn't work out well for large files, because they have to be read from disc entirely.

      If you care about performance, you might just want to hash the first 5% (or the first 1k or whatever) and see if there are any collisions, and if there are you can still look at the entire file.

        I can agree. Another measure, taking in mind the performance, can be the filesize comparison before all.


Re: Recursive search for duplicate files
by props (Hermit) on Nov 27, 2007 at 14:02 UTC
    Well file and folder names are sufficient.
    use warnings; use strict; use File::Find; my $directory = '/mnt/music/very-good'; find (\&wanted, $directory); my %path_file; sub wanted { my $path = $File::Find::dir; my $filename = $File::Find::name; $path_file{$path} = $filename; my %count; while (my ($key , $value) = each(%path_file)) { $count{$key} +=1; } }
      There you go: the idea is to store for each filename in which paths it occurs.
      #!/usr/bin/perl use warnings; use strict; use File::Find; use File::Spec; my $directory = shift @ARGV || '/mnt/music/very-good'; find (\&wanted, $directory); my %path_file; sub wanted { my $path = $File::Find::dir; my $filename = File::Spec->abs2rel($File::Find::name, $path); push @{$path_file{$filename}}, $path; } while (my ($filename, $paths) = each %path_file){ if (scalar @$paths >= 2){ print "$filename occurs in these paths: ", join(", ", @$paths) +, "\n"; } # else { # print "$filename is uniq\n"; # } }
        I have a question: Since %path_file is a hash why becomes an array here:
        push @{$path_file{$filename}}, $path;
        Dereferencing the array @path_file or the hash %path_file
Re: Recursive search for duplicate files
by strat (Canon) on Nov 28, 2007 at 08:00 UTC

    If you want to compare files by content (and not by name), you could have a look at the search and compare algorithm of http://www.fabiani.net/ -> Perl -> Downloads -> FindDuplicateFiles. It only runs under Win32 (because I only need it under Win32), but it is easy to port.

    Best regards,
    perl -e "s>>*F>e=>y)\*martinF)stronat)=>print,print v8.8.8.32.11.32"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://653215]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2024-04-18 10:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found