http://www.perlmonks.org?node_id=951677


in reply to Re^3: how to merge many files of sorted hashes?
in thread how to merge many files of sorted hashes?

i think the sort-merge join you explained earlier should solve my problem immediately (i didn't have an idea on how to do it). now i'm more debating on whether i need to incorporate this into a database because my programming is really beginner's level.
the entire purpose of doing this is explained here. the estimated size of all hash tables is probably a few hundred GB.
let me try your suggestion first. thank you very much for your help. really appreciate it.
  • Comment on Re^4: how to merge many files of sorted hashes?

Replies are listed 'Best First'.
Re^5: how to merge many files of sorted hashes?
by wrog (Friar) on Feb 03, 2012 at 22:47 UTC
    now i'm more debating on whether i need to incorporate this into a database
    It's a hard call.
    • On the one hand, the database folks have (theoretically, at least) already done the work of optimizing so that the disk and memory get used most effectively; rolling your own risks re-inventing the wheel, as it were
    • On the other hand, relational databases are completely generic, this appears not to be a typical database application, and given the volume of data that you have — it not being unusual for data to take up 3-10 times as much space once it gets incorporated into the database, because there's all of this random overhead due to the designers having to plan for everything rather than just your specific application — you're going to care a lot about performance, which means you'll have to tune things rather carefully, e.g., make sure you build the right indices for the right columns, be aware of what kinds of joins will be faster or slower, etc...

      the core problem being that the database is adding a whole new layer to learn about and if stuff turns out to be slow, it may be that much harder to find the culprit, whereas if you have some understanding of what the underlying hardware does and your application is sufficiently strange, you may indeed get better performance just writing it directly, even as a beginning programmer (emphasis on "may")
    • On the third hand, databases can sometimes do more than you think they can. In particular, your data seems to be spatial in nature. If things are not varying in the 3rd dimension that much (i.e., you essentially have geographic data as opposed to, say, astronomical data), you may want to check out PostGIS, which is a database that has spatial operators in it.
    You may also just want to try replacing "copy key+value to output file" to "INSERT INTO table VALUES(?, ?...)",key,values... and see what the relative speed / memory usage is like.

    Main things to know about disks:

    • Moving the read head takes a long time (milliseconds) as compared with memory stuff (microseconds) and CPU/cache stuff (nanoseconds), so you have high latency, i.e., once you issue the read command, you're going to be waiting quite a while for data to arrive (...if you have other work the CPU can be doing while it's waiting, so much the better, but (a) that's hard to do right in Perl and (b) this isn't actually happening as often as you might think, see next two items)
    • Disks are capable of random access (i.e., you don't have to start reading at the beginning of a file; this is what seek and tell are for), but...
    • Disks are optimized for sequential access. Once the read head gets to a certain spot and reads the byte that you want, the disk is still spinning and the head has got nothing better to do but keep reading, and so you may get the next 100,000 bytes showing up in your disk cache (a piece of memory set aside for the disk driver) essentially for free. Meaning it's not the case that every read command is causing the head to move and especially not the case when you're reading sequentially.

    Main things to know about memory:

    • everything you create (hashes, arrays, objects) lives in memory somewhere; you can't really control where, but you can control how much you're creating at any given time, and it's usually a good bet that stuff created immediately before or afterwards will be in roughly the same place.
    • memory comes in pages, a page being a range of memory addresses (e.g., 4 Kbytes); if you try to use too many different pages at the same time, then some of your pages will actually be living on the disk (because you never actually have that much memory), at which point, see above (i.e., if you're wondering why your program is drastically slowing down).

    I'd say understanding Locality, i.e., trying to keep the stuff you're working on at any given time in close proximity in memory, is about 70% of Computer Science, and these days is generally most of the battle in getting something to run fast once you've settled onto a reasonable algorithm.

      Thank you so much for the detailed explanation, Wrog. I do think the learning curve is steep if I'd like to implement a database for my particular research problem, given my current standing. however, i'd definitely make an effort to understand basics like how disks and memory work as i work along my projects, which i now feel is indispensable. the scales of my previous work just never blow up like this to remind me just how poorly my algorithms were written.
      as for postgis, i have not taken a close look at it yet, and i'm not entirely sure that the database design permits the kind of flexibility i need for my actual application; i have many more attributes other than coordinates of the points that i need to take care of, so maybe coding it myself will actually be easier and faster (to meet my project deadline).
      i just tried the method you suggested first and it went orders of magnitudes faster (as expected). i used this 100 point cloud to test the upper bound of the running time for all the point clouds, and with this largest one finishing within a day my calculations are now much more scalable. thank you so much!