Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re^2: Remove Duplicate Files

by DrHyde (Prior)
on Oct 29, 2004 at 08:27 UTC ( [id://403630]=note: print w/replies, xml ) Need Help??

in reply to Re: Remove Duplicate Files
in thread Remove Duplicate Files

Agreed. You can make it a lot more efficient by stat()ing all the files and only bothering to compare the contents of those which are the same size. Another small improvement can come from noting that those with the same device number and inode number are guaranteed to be the same so no need to compare their contents, although this may not be portable to non-Unixy platforms.

You should also be careful about how you compare symlinks and device files.

Replies are listed 'Best First'.
Re^3: Remove Duplicate Files
by Anonymous Monk on Oct 29, 2004 at 09:34 UTC
    And further improvement can be made by reading in just the first 1024 bytes or so, and calculate the md5 from that. Only if those match, you do a full comparison.
Re^3: Remove Duplicate Files
by gaal (Parson) on Oct 29, 2004 at 08:34 UTC
    Then again, hardlinks are less of a concern for cleanup, because they don't waste disk space.
      Well, any program that compares files and removes duplicates that doesn't look at whether they are links, will remove excess links. By looking at the inodes and device numbers to detect links, you can gain one of two things: have the option to *keep* links - which can be pretty useful for binaries that act different on how they are invoked, or a more speedy comparions, as you don't have to calculate the md5 hash, and then then compare the entire file.
        Of course. I was pointing out that if the purpose of the tool was to reduce disk usage, keeping hardlinks wouldn't hurt its functionality. You are right that hardlinks can often be a good thing, but without further information about the environment this was supposed to run in, we can't tell whether leaving them is the right thing. (Probably, it's just irrelevant and ok to leave undefined.)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://403630]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-21 20:33 GMT
Find Nodes?
    Voting Booth?

    No recent polls found