|Keep It Simple, Stupid|
Alternatives to DB for comparable listsby peterrowse (Initiate)
|on May 15, 2018 at 21:32 UTC||Need Help??|
peterrowse has asked for the
wisdom of the Perl Monks concerning the following question:
Although there is much written about how to handle large amounts of data efficiently here already, I find myself for the first time in a few years not being able to find a concrete answer to my specific problem, so I wondered if anyone with experience in these matters might comment.
Its probably a database / not database type of point, in that I need to md5 many files across several machines, compare them and find duplicates. Since complete integrity against bit rot is needed I am using the md5 route rather than file size, date etc characteristics.
So the total number of files which will in the end be md5ed will probably be around 750k, and each machine probably will have up to 250k files in its file system. The file size will range from 30GB for very few files to less than a k for many, with most being in the few MB range. But that is probably not particularly important, whats I am having difficulty with is how I store the md5 sums.
I will need to record somehow around 250k path / filename / size and date / md5sums for each machine, 6 machines in total. Then at a later time I will compare each list and work out a copying strategy which minimises copying time over a slow link but makes sure that if there are any files which are named the same but differ in md5 sum they can be checked manually.
So whether I do this with a database on each machine, with each dataset then copied to my processing machine to be compared, or I do it with another perl tool which implements a simpler type of storage, later copied and processed in the same way, I am wondering. I will want to sort or index the list(s) somehow, probably by both md5 sum and filename I am thinking, to be able to be through in checking for duplicates and bad files. In processing the completed md5 lists I will probably want to read each md5 into a large array, check for duplicates with each read, and then check the other file characteristics if a duplicate is detected as it happens, and create a list of results which I can then act on.
Opinions on which route (DB or any particular module which might suit this application) would be greatly appreciated.
Kind regards, Pete