Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^2: how to merge many files of sorted hashes?

by andromedia33 (Novice)
on Feb 03, 2012 at 01:14 UTC ( #951576=note: print w/replies, xml ) Need Help??

in reply to Re: how to merge many files of sorted hashes?
in thread how to merge many files of sorted hashes?

thank you very much! this sounds exactly like what i'm looking for.
on a second thought, maybe i do not HAVE to push everything inside a single file, but just keeping a few files. i need to think about this point, because i'll need to do pairwise comparison after constructing these hash tables (table against table) by looking for common keys. originally i thought that by having all keys sorted in a single file, it's really easy for me to make such comparisons. but maybe having several files can save a lot more cost at the table construction stage.
  • Comment on Re^2: how to merge many files of sorted hashes?

Replies are listed 'Best First'.
Re^3: how to merge many files of sorted hashes?
by wrog (Friar) on Feb 03, 2012 at 07:59 UTC
    Having seen your code here it's now clear that a given key can show up in multiple files and therefore you probably do want to be merging those values and only having the key appear in one place, which then means that a certain amount of copying is unavoidable.

    There's also an advantage to having output files that hold every key within a certain range, which then means your master file/table doesn't need to have a column saying which file a key is in (i.e., since you just infer it from the range the key is in), which then saves massive amounts of space...

      i think the sort-merge join you explained earlier should solve my problem immediately (i didn't have an idea on how to do it). now i'm more debating on whether i need to incorporate this into a database because my programming is really beginner's level.
      the entire purpose of doing this is explained here. the estimated size of all hash tables is probably a few hundred GB.
      let me try your suggestion first. thank you very much for your help. really appreciate it.
        now i'm more debating on whether i need to incorporate this into a database
        It's a hard call.
        • On the one hand, the database folks have (theoretically, at least) already done the work of optimizing so that the disk and memory get used most effectively; rolling your own risks re-inventing the wheel, as it were
        • On the other hand, relational databases are completely generic, this appears not to be a typical database application, and given the volume of data that you have — it not being unusual for data to take up 3-10 times as much space once it gets incorporated into the database, because there's all of this random overhead due to the designers having to plan for everything rather than just your specific application — you're going to care a lot about performance, which means you'll have to tune things rather carefully, e.g., make sure you build the right indices for the right columns, be aware of what kinds of joins will be faster or slower, etc...

          the core problem being that the database is adding a whole new layer to learn about and if stuff turns out to be slow, it may be that much harder to find the culprit, whereas if you have some understanding of what the underlying hardware does and your application is sufficiently strange, you may indeed get better performance just writing it directly, even as a beginning programmer (emphasis on "may")
        • On the third hand, databases can sometimes do more than you think they can. In particular, your data seems to be spatial in nature. If things are not varying in the 3rd dimension that much (i.e., you essentially have geographic data as opposed to, say, astronomical data), you may want to check out PostGIS, which is a database that has spatial operators in it.
        You may also just want to try replacing "copy key+value to output file" to "INSERT INTO table VALUES(?, ?...)",key,values... and see what the relative speed / memory usage is like.

        Main things to know about disks:

        • Moving the read head takes a long time (milliseconds) as compared with memory stuff (microseconds) and CPU/cache stuff (nanoseconds), so you have high latency, i.e., once you issue the read command, you're going to be waiting quite a while for data to arrive (...if you have other work the CPU can be doing while it's waiting, so much the better, but (a) that's hard to do right in Perl and (b) this isn't actually happening as often as you might think, see next two items)
        • Disks are capable of random access (i.e., you don't have to start reading at the beginning of a file; this is what seek and tell are for), but...
        • Disks are optimized for sequential access. Once the read head gets to a certain spot and reads the byte that you want, the disk is still spinning and the head has got nothing better to do but keep reading, and so you may get the next 100,000 bytes showing up in your disk cache (a piece of memory set aside for the disk driver) essentially for free. Meaning it's not the case that every read command is causing the head to move and especially not the case when you're reading sequentially.

        Main things to know about memory:

        • everything you create (hashes, arrays, objects) lives in memory somewhere; you can't really control where, but you can control how much you're creating at any given time, and it's usually a good bet that stuff created immediately before or afterwards will be in roughly the same place.
        • memory comes in pages, a page being a range of memory addresses (e.g., 4 Kbytes); if you try to use too many different pages at the same time, then some of your pages will actually be living on the disk (because you never actually have that much memory), at which point, see above (i.e., if you're wondering why your program is drastically slowing down).

        I'd say understanding Locality, i.e., trying to keep the stuff you're working on at any given time in close proximity in memory, is about 70% of Computer Science, and these days is generally most of the battle in getting something to run fast once you've settled onto a reasonable algorithm.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://951576]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2018-06-25 04:46 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.