Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Alternatives to DB for comparable lists

by peterrowse (Initiate)
on May 15, 2018 at 21:32 UTC ( #1214587=perlquestion: print w/replies, xml ) Need Help??
peterrowse has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Although there is much written about how to handle large amounts of data efficiently here already, I find myself for the first time in a few years not being able to find a concrete answer to my specific problem, so I wondered if anyone with experience in these matters might comment.

Its probably a database / not database type of point, in that I need to md5 many files across several machines, compare them and find duplicates. Since complete integrity against bit rot is needed I am using the md5 route rather than file size, date etc characteristics.

So the total number of files which will in the end be md5ed will probably be around 750k, and each machine probably will have up to 250k files in its file system. The file size will range from 30GB for very few files to less than a k for many, with most being in the few MB range. But that is probably not particularly important, whats I am having difficulty with is how I store the md5 sums.

I will need to record somehow around 250k path / filename / size and date / md5sums for each machine, 6 machines in total. Then at a later time I will compare each list and work out a copying strategy which minimises copying time over a slow link but makes sure that if there are any files which are named the same but differ in md5 sum they can be checked manually.

So whether I do this with a database on each machine, with each dataset then copied to my processing machine to be compared, or I do it with another perl tool which implements a simpler type of storage, later copied and processed in the same way, I am wondering. I will want to sort or index the list(s) somehow, probably by both md5 sum and filename I am thinking, to be able to be through in checking for duplicates and bad files. In processing the completed md5 lists I will probably want to read each md5 into a large array, check for duplicates with each read, and then check the other file characteristics if a duplicate is detected as it happens, and create a list of results which I can then act on.

Opinions on which route (DB or any particular module which might suit this application) would be greatly appreciated.

Kind regards, Pete

  • Comment on Alternatives to DB for comparable lists

Replies are listed 'Best First'.
Re: Alternatives to DB for comparable lists
by afoken (Abbot) on May 15, 2018 at 22:29 UTC
    complete integrity against bit rot is needed

    Consider using a filesystem that guarantees exactly that. ZFS does, when using several disks, and can be considered stable. It also offers deduplication, snapshots, transparent compression, replication across machines, and more.

    btrfs attempts to do the same, or at least some of the important parts, but I would still call it experimental.

    If you want a no-brainer setup, find a x86_64 machine with RAM maxed out, add some disks, and install FreeNAS or the fork NAS4Free. It natively uses ZFS and doesn't force you to think much about it. Add a similar second machine and use replication for backup.


    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      I am already a convert - the destination is ZFS, its great. Problem is all this legacy stuff, spanning many years, many times moved / duplicated from disk to disk etc, although its all on EXT4 now, which seems pretty good, but in the past... So at times I am going to need to work out which are the damaged versions manually. But I hope to hone it down to a few files with the rest all matching from version to version (hopefully a few that is, we will see).
Re: Alternatives to DB for comparable lists
by mxb (Monk) on May 16, 2018 at 10:04 UTC

    If I understand correctly, you wish to obtain the following for each file:

    • MD5 hash
    • Source server
    • File path
    • File name
    • Date of collection

    Where the files are distributed over six servers.

    This probably depends upon how you are planning to collect all the data, but my personal approach would be to have a small script running on each of the six servers performing the hashing and sending each result back to a common collector. This assumes network connectivity.

    I think it would be relatively easy to calculate the tuple of the five items for each server with a script and issue them over the network back to a central collection script. Each server can be hashing and issuing results simultaneously to the same collector.

    While there may be a lot of data to hash, the actual results are going to be small. Therefore, as you know exactly what you are obtaining (the five items of data) I would just go the easiest route and throw them in a table in DBD::SQLite.

    Then, once you have all the data in your DB, you can perform offline analysis as much as you want, relatively cheaply.

    As a side note, I'd probably go with SHA-256 rather than MD5 as MD5 collisions are more common, and it's not that much more computationally expensive.

      To add to your answer, i have a similar system running on some of my servers, indexing some pretty nastily-disorganized windows fileshares. I put everything into a PostgreSQL database. That lets me do all kinds of metadata analysis with a few simple SQL statements.

      Everything "below a few tens of millions of entries" shouldn't be a problem for a decent low- to midrange server build within the last 8 years. My current, 8 year old, development server is used for this kind of crap all the time without any issues.

      I'm pretty sure that running fstat() on all those files is going to be a major slowdown, and the checksuming certainly needs to be done locally, not over the network.

      "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."
Re: Alternatives to DB for comparable lists
by Perlbotics (Chancellor) on May 16, 2018 at 18:32 UTC

    One approach might be:

    • setup a DB-Server on your collection host
    • run your MD5 tool on each host and depending on your network availability:
      • with networking: contact DB and INSERT the new data on the fly (via internal network or SSH-/VPN-tunnel)
      • w/o networking: output data line by line in a format that your DB supports for batch-loading (store in file for offline transport)
    • run your tasks on the DB

    Perhaps sending the batch-lines to STDOUT is the easiest approach where the tool could even be invoked by an ssh-command issued on the collection host? That also eliminates the requirement for DB drivers on the host to be scanned.

    Use a header/trailer or checksum to assert completeness/integrity of the chunk of lines transmitted and perhaps also add some interesting meta-data (creation time, IP, etc.).


    Oh, you asked for DB-alternatives... Rough estimation: 750k entries with a mean entry size of ca. 500 bytes results in a total size of approx. 375 MB. My experiment with Storable resulted in a file of size 415 MB. Reading/writing took ca. 2.0/3.5s on a moderate PC (3GHz, SSD).

    Merging and storing all data into a native Perl data structure and using Storable for persistence looks feasible. PRO: fast speed for analytics; CON: no luxury that comes with a DB.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1214587]
Approved by haukex
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2018-05-25 02:11 GMT
Find Nodes?
    Voting Booth?