So I dug it out, read it, and realised how horrible it was. I was tempted to rewrite it, but instead I decided to google for "perl duplicate files" first. I found a couple of other scripts there, but they were pretty horrible too. In particular the first file there, which is basically a comparison between doing in perl vs shell, does a checksum hashing on every file. So I decided I would indeed write my own, which turned out to be about 7 times faster that this one (which was in turn twice as fast as my original script):
#!/usr/bin/perl -w use strict; use File::Find; use Digest::MD5; my %files; my $wasted = 0; find(\&check_file, $ARGV[0] || "."); local $" = ", "; foreach my $size (sort {$b <=> $a} keys %files) { next unless @{$files{$size}} > 1; my %md5; foreach my $file (@{$files{$size}}) { open(FILE, $file) or next; binmode(FILE); push @{$md5{Digest::MD5->new->addfile(*FILE)->hexdigest}},$file; } foreach my $hash (keys %md5) { next unless @{$md5{$hash}} > 1; print "$size: @{$md5{$hash}}\n"; $wasted += $size * (@{$md5{$hash}} - 1); } } 1 while $wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/; print "$wasted bytes in duplicated files\n"; sub check_file { -f && push @{$files{(stat(_))[7]}}, $File::Find::name; }
Tony
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Find duplicate files.
by mwp (Hermit) on Jan 02, 2001 at 21:31 UTC | |
If the other monks here think it's solid and all, you should OO it and send it to the author of File::Find as File::Find::Duplicates. =) | [reply] |
Re: Find duplicate files.
by lemming (Priest) on Jun 02, 2001 at 20:42 UTC | |
Interesting. I just went through the similar problem of combining four computer's worth of archives. In some cases I had near duplicates due to slight doc changes and the like, so I wanted a bit more information. I had a second program do the deletes. (About 9,000 files) I couldn't go by dates, due to bad file management Note that the file statement uses the 3 arg version. I had some badly named files such as ' ha'. I wish I could remember the monk name that pointed out the documentation for me.
| [reply] [d/l] |
by bikeNomad (Priest) on Jun 02, 2001 at 21:13 UTC | |
The CRC32 found in Compress::Zlib runs 82% faster than Digest::MD5 on my system, using the following benchmark program:
| [reply] [d/l] |
by mimosinnet (Beadle) on Apr 02, 2012 at 14:13 UTC | |
I am new to perl (and to writing code) and I have just been in an excellent course organized by Barcelona_pm. I have rewritten lemming code as an exercise of using Moose. To improve speed, following above suggestions, files with similar size are first identified and, afterwards, md5 value is calculated in these files. Because this is baby-code, please feell free to recomend any RTFM $manual that I sould review to improve the code. Thanks for this great language! (I have to thank Alba from Barcelona_pm for suggestions on how to improve the code). This is the definition of the object "FileDups"
And this is the main package that lists duplicate files, big files and unread files.
| [reply] [d/l] [select] |
by Anonymous Monk on Oct 10, 2008 at 03:36 UTC | |
| [reply] [d/l] |
Re: Find duplicate files.
by grinder (Bishop) on Feb 26, 2001 at 19:56 UTC | |
Interesting. I wrote my own that does pretty much the same thing, but in a different way (I only use one hash, so I suspect it will use less memory (but see response below for the final word)).
It is very verbose, but that's because I pipe the output into something that can be handed off to users in a spreadsheet so that they can do their own housekeeping (2Gb of duplicates in 45Go of files...). BTW, you can also save a squidgin of memory by using the digest() method, rather than the hexdigest() method, since the value is not intended for human consumption. | [reply] [d/l] |
by Anonymous Monk on Jun 02, 2001 at 15:29 UTC | |
The first script will only do MD5 hashes on files if there is more than one file with the same file size, then compares the MD5s for the files of that size. Yours MD5's *everything*, then compares *all* the MD5s. If a file has a unique filesize, it *can't* have a duplicate. Depending on the make up of the files, this can have a dramatic effect:
Results: The second script is four times slower than the first... Admittedly, if all your files were the same size there would be no difference, but in most cases, the first script will win. But hey... | [reply] [d/l] [select] |
by sarabob (Novice) on Jun 02, 2001 at 20:16 UTC | |
The code for script three is below Benchmarks:
scenario 1: a bundle of MP3 files (script one - original, MD5 hash files with same file size) (script two - MD5 hash calculated on *all* files) fdupes (C program from freshmeat.net - uses same algorithm as script two) (script three - see below, read up to first \n char for initial check, then read whole file in for full check. No MD5s calculated at all) Yes, that *is* 48 seconds rather than 5 or 17 minutes. This is because script 3 reads the first line in as a comparison first - creating an MD5 hash requires that the whole file is read in.
Scenario 2: home directory Script one results Script two results fdupes (C program from freshmeat.net - uses same algorithm as script two) Script three results (Note less duplicates found by script three as it skips all the small files < 100 bytes) The third script is slower than the first in this situation as it must do multiple compares (ie a with b, a with c, a with d) rather than using the MD5 hashing technique It would be even slower if we counted small files (timed at around 23 seconds). Both 1 and 3 are still *much* faster than 2 though. The fdupes benchmarks are just in there for comparison to show how a bad algorithm can slow down a fast language. Also note that not using MD5 hashes means I suffer if there are three or more identical, large, files, but I wanted to be *absolutely* sure not to get any false positives and MD5 hashing doesn't (quite) do that. So I do a byte-for-byte comparison between possible pairs. There is almost certainally another way - we could do two passes using the MD5 technique, creating MD5 hashes for the first (say) 200 bytes of each file in the first pass, then MD5-ing on the whole file if the first ones match. This should give us good performance on both large numbers of duplicated small files *and* small numbers of duplicates of large files. But that's something for another day, and I somehow *prefer* to do byte-by-byte checks. Paranoia, I guess. Anyway - here's the code... fdupes.pl (usage: fdupes.pl <start dir>):
| [reply] [d/l] [select] |
by wazoox (Prior) on Sep 01, 2014 at 20:20 UTC |