Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: data structure advice please

by mickeyn (Priest)
on Nov 25, 2006 at 20:03 UTC ( #586046=note: print w/replies, xml ) Need Help??

in reply to data structure advice please

If you're hunting duplicates - it would have been wiser to check md5 sums (Digest::MD5) rather than names.

As far as data structure goes - I would recommend something in the form of:

%files = ( <md5sum1> => [ <path1>, <path2> ... ], <md5sum2> => [ <path1>, <path2> ... ], );
It will allow you to easily iterate over your files, locate and count them.


-- Mickey

Replies are listed 'Best First'.
Re^2: data structure advice please
by johngg (Abbot) on Nov 25, 2006 at 23:02 UTC
    It would probably be better to build a HoA keyed by file size ((stat ($file))[7]) rather than MD5 sum, with values being arrays of files of a particular size. Any two files of different size cannot be duplicates, obviously. Any hash element that contained just one file could then be discarded, thus avoiding the expense of MD5 sums or file comparisons for a proportion of the files you are testing.

    Once you have sets of files the same size you can compare them either by generating MD5 sums, by reading the files (slurping if small or in chunks if large) and doing string comparisons or by using external commands like cmp. (I would recommend against using external commands.) You can save a lot of time by avoiding re-doing comparisons when you have several files of the same size. For example, given fileA to fileE, you would logically start by comparing fileA to the other four in turn, then fileB to fileC, fileD and fileE, and so on. If fileA differs from fileB but is the same as fileE you can see that it is not necessary to compare fileB with fileE because you already know they differ.

    I hope these thoughts are of use.



Re^2: data structure advice please
by anadem (Scribe) on Nov 25, 2006 at 21:53 UTC
    thanks, that will get binary dupes nicely. I wanted to start with duped filenames and leave binaries for a later pass - I find quite a bit of the music my kids leave around has different binary content but same filenames, but the vice-versa case happens too so I'll add md5 summing as an option. I also want to save the date and size as well as the path+name (hence the array). The bit I've been most unclear on is how to go from finding the first file and setting
    key1 => [ <path1> ]
    to adding the data for the second instance:
    key1 => [ <path1>, <path2> ]
    but hopefully I can put the suggestions above into practise now

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://586046]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (13)
As of 2017-05-30 14:04 GMT
Find Nodes?
    Voting Booth?