Recursive search for duplicate files

props has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Recursive search for duplicate files by moritz (Cardinal) on Nov 27, 2007 at 13:34 UTC
What are "duplicate" files for you? Files with the same basename? Or with with identical contents? It seems that you are using too many hashes, that are initialized nowhere: %hash is used but never declared, %path_file used and neither declared nor initialized.	[reply]
Re: Recursive search for duplicate files by hawtin (Prior) on Nov 27, 2007 at 14:00 UTC
Have you seen Scanning for duplicate files and Find Duplicate Files and the CPAN module: `File::Find::Duplicates` [download]	[reply] [d/l]
Re: Recursive search for duplicate files by sh1tn (Priest) on Nov 27, 2007 at 13:35 UTC
Much better way is to use MD5 for file comparison.	[reply]
Re^2: Recursive search for duplicate files by moritz (Cardinal) on Nov 27, 2007 at 13:41 UTC
If used naively, that doesn't work out well for large files, because they have to be read from disc entirely. If you care about performance, you might just want to hash the first 5% (or the first 1k or whatever) and see if there are any collisions, and if there are you can still look at the entire file.	[reply]
Re^3: Recursive search for duplicate files by sh1tn (Priest) on Nov 27, 2007 at 14:04 UTC
I can agree. Another measure, taking in mind the performance, can be the filesize comparison before all.	[reply]
Re: Recursive search for duplicate files by props (Hermit) on Nov 27, 2007 at 14:02 UTC
Well file and folder names are sufficient. `use warnings; use strict; use File::Find; my $directory = '/mnt/music/very-good'; find (\&wanted, $directory); my %path_file; sub wanted { my $path = $File::Find::dir; my $filename = $File::Find::name; $path_file{$path} = $filename; my %count; while (my ($key , $value) = each(%path_file)) { $count{$key} +=1; } }` [download]	[reply] [d/l]
Re^2: Recursive search for duplicate files by moritz (Cardinal) on Nov 27, 2007 at 14:16 UTC
There you go: the idea is to store for each filename in which paths it occurs. #!/usr/bin/perl use warnings; use strict; use File::Find; use File::Spec; my $directory = shift @ARGV \|\| '/mnt/music/very-good'; find (\&wanted, $directory); my %path_file; sub wanted { my $path = $File::Find::dir; my $filename = File::Spec->abs2rel($File::Find::name, $path); push @{$path_file{$filename}}, $path; } while (my ($filename, $paths) = each %path_file){ if (scalar @$paths >= 2){ print "$filename occurs in these paths: ", join(", ", @$paths) +, "\n"; } # else { # print "$filename is uniq\n"; # } } [download]	[reply] [d/l]
Re^3: Recursive search for duplicate files by props (Hermit) on Nov 27, 2007 at 20:53 UTC
I have a question: Since %path_file is a hash why becomes an array here: `push @{$path_file{$filename}}, $path;` [download] Dereferencing the array @path_file or the hash %path_file	[reply] [d/l]
Re^4: Recursive search for duplicate files by moritz (Cardinal) on Nov 27, 2007 at 20:58 UTC
Re: Recursive search for duplicate files by strat (Canon) on Nov 28, 2007 at 08:00 UTC
If you want to compare files by content (and not by name), you could have a look at the search and compare algorithm of http://www.fabiani.net/ -> Perl -> Downloads -> FindDuplicateFiles. It only runs under Win32 (because I only need it under Win32), but it is easy to port. Best regards, perl -e "s>>F>e=>y)\martinF)stronat)=>print,print v8.8.8.32.11.32"	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks