Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Alternatives to DB for comparable lists

by mxb (Monk)
on May 16, 2018 at 10:04 UTC ( #1214623=note: print w/replies, xml ) Need Help??

in reply to Alternatives to DB for comparable lists

If I understand correctly, you wish to obtain the following for each file:

  • MD5 hash
  • Source server
  • File path
  • File name
  • Date of collection

Where the files are distributed over six servers.

This probably depends upon how you are planning to collect all the data, but my personal approach would be to have a small script running on each of the six servers performing the hashing and sending each result back to a common collector. This assumes network connectivity.

I think it would be relatively easy to calculate the tuple of the five items for each server with a script and issue them over the network back to a central collection script. Each server can be hashing and issuing results simultaneously to the same collector.

While there may be a lot of data to hash, the actual results are going to be small. Therefore, as you know exactly what you are obtaining (the five items of data) I would just go the easiest route and throw them in a table in DBD::SQLite.

Then, once you have all the data in your DB, you can perform offline analysis as much as you want, relatively cheaply.

As a side note, I'd probably go with SHA-256 rather than MD5 as MD5 collisions are more common, and it's not that much more computationally expensive.

  • Comment on Re: Alternatives to DB for comparable lists

Replies are listed 'Best First'.
Re^2: Alternatives to DB for comparable lists
by cavac (Deacon) on May 16, 2018 at 11:59 UTC

    To add to your answer, i have a similar system running on some of my servers, indexing some pretty nastily-disorganized windows fileshares. I put everything into a PostgreSQL database. That lets me do all kinds of metadata analysis with a few simple SQL statements.

    Everything "below a few tens of millions of entries" shouldn't be a problem for a decent low- to midrange server build within the last 8 years. My current, 8 year old, development server is used for this kind of crap all the time without any issues.

    I'm pretty sure that running fstat() on all those files is going to be a major slowdown, and the checksuming certainly needs to be done locally, not over the network.

    "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1214623]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2018-11-13 02:30 GMT
Find Nodes?
    Voting Booth?
    My code is most likely broken because:

    Results (149 votes). Check out past polls.