http://www.perlmonks.org?node_id=732766


in reply to Memory Efficient Alternatives to Hash of Array

Consider storing your HoA on disk.

DBM::Deep is just the ticket for this type of job (large lookup tables).

Replies are listed 'Best First'.
Re^2: Memory Efficient Alternatives to Hash of Array
by tilly (Archbishop) on Dec 27, 2008 at 14:41 UTC
    When someone is dealing with a large data set you need to trade off programmer efficiency against performance. While we in the Perl world are used to valuing programmer efficiency more, this stops being true with large datasets.

    Consider 5 GB of data broken up into 50 byte lines. So there are 100 million lines of data. Suppose we want to store that and retrieve it into DBM::Deep. For the sake of argument let's say that each store or retrieve takes one seek to disk. So that's 200 million seeks to disk.

    How long does 200 million seeks to disk take? Well suppose that your disk spins at 6000 rpm. (This is typical.) That means it spins 100 times per second. A disk seek will therefore take between 0 and 0.01 seconds, or 0.005 seconds on average. 200 million seeks therefore takes a million seconds. Which is 11.57 days, or a week and a half.

    Now how long does sorting that data take? Well let's assume an absurdly slow disk - 10 MB/s. (Real sorting algorithms keep a few passes in RAM and so will need fewer passes.) Suppose we code up a merge-sort and need 30 passes to disk. Each pass needs to read and wrote 5 GB. We therefore have 300 GB of throughput at 10 MB/s which will take 30,000 seconds, or a bit over 8 hours. (If your machine really takes this long to sort this much data, you should upgrade to a machine from this millennium.)

    The moral? Hard drives are not like RAM. DBM::Deep and friends are efficient for programmers, but not for performance. If you have existing complex code that needs to scale, consider using them. But it is worth some programmer effort to stay away from them.

      That's a pretty detailed explanation.

      But am not very sure about the conclusion you have stated.

      Based on your comment, it seems sorting ( whatever be the size of the dataset ) is going to take much lesser time compared to other methods like DBM::Deep

      So, what exactly is the demarcating line between when to use 'sorting' and when to use DBM::Deep (for example)?

      Would you mind elaborating on that? Thanks
        The conclusion is correct. If you have a large data set living on disk, sorting is orders of magnitude more efficient. Furthermore on most commodity hardware you can't use DBM::Deep for a dataset this size because DBM::Deep is limited to a 4 GB filesize unless you are using a 64-bit Perl and you turn on the right options. But there are still many use cases for DBM::Deep.

        The most important is when you have existing code and a data set that is just a little bit too big to handle in RAM. You don't want to rewrite your code, so you use DBM::Deep and it will work, if slowly.

        A second case is when you have a pre-built data structure that you need to access. For instance you have a local index that you look things up in when serving a web page. Sure, building it is slow. But a typical web request is going to just do a lookup, which will be plenty fast. As long as you are grabbing a small amount of data each time, it will be quick.

        But as cool as it is, it has limitations due to the physical limitations of machines, and you sometimes need to be aware of them.

Re^2: Memory Efficient Alternatives to Hash of Array
by dragonchild (Archbishop) on Dec 29, 2008 at 03:25 UTC
    As the maintainer of DBM::Deep, I need to echo tilly here. While dbm-deep can allow you to address such large datasets (>4G requires 64-bit Perl just because 32 bits only addresses 4G and change), it is going to be very slow. Just building the 4G file can take a very long time. Sorting a large array (>10k entries, roughly) in dbm-deep will take a very very long time.

    While one of dbm-deep's use cases is dealing with data that's too large to fit in RAM, you really want to have that happen with lookups, not when dealing with large datasets. There are other languages and setups better suited for this, such as Erlang or CouchDB.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?