Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Memory Efficient Alternatives to Hash of Array

by neversaint (Deacon)
on Dec 27, 2008 at 06:22 UTC ( #732751=perlquestion: print w/ replies, xml ) Need Help??
neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
My code below, tries to group the 'error_rate' (second column of data) based on its corresponding tag (first column of data). The grouping is done using Hash Of Array (HoA). Later, I will process each of these groups with some functions (not shown here).

However I found that given a large dataset, HoA is not feasible.
Is there any better memory efficient alternative to hash of array to address the problem equivalently?
use strict; use Data::Dumper; use Carp; my %hold; while ( <DATA> ) { chomp; next if (/^\#/); my @elem = split(/\t/,$_); # Keep in Hash of Array # HoA here consumes so much memory push @{$hold{$elem[0]}}, $elem[1]; } # is there a better way to keep/process the array # other than HoA like %hold ? foreach my $key ( sort keys %hold ) { my @ary = @{$hold{$key}}; # then I will process @ary for each key above print "$key\n"; } # in practice there are ~4-5Gb of such lines below __DATA__ #Tags Error_Rate_In_ASCII AATACGGCCACCCCCCCCCCCCCCGCCCCTCCCC INILILFIIIIQNQQNQNLLKFKNCDHA?DAH +HH CTTTCCCTCCACGACGCTCTTCCGCTCTCATGAT QQIQQQQQIQQQIQQLQNQNOPNKIHHHAHHA +AA TCCACTCTTTCCCTACACGACGCTCTTCCGATCT QFQFQQQQQQQQQQQQIQLFNNPONHHHHHDH +HH TCCCCTCTTTCCCTACACGACGCTCTTCCGATCT UIUIUUUUUUUUUULUUUIOUKUNULLLLKKL +LK TGATACGGCGACCACCGAGATCATCACACTTTCC UUUUUUUUUUUUUQUUTUUUUULLUKRHPKIH +HO TGATACGGCGACCACCGAGATCTACACTCTTTCC UOIUIUUUUUUUUIUUUOUOUUUUUKLLLLIK +KL TGATACGGCGACCACCGAGATCTACACTCTTTCC UUUUMUUUUUUUUIUUIUUQUUUUUOOOOOOO +OO TGATACGGCGACCACCGAGATCTACACTCTTTCC UUUUUUUUUUUUUUUUUTUUUUUUURRRRRMP +PQ TGATACGGCGACCACCGAGATCTACACTCTTTCC UUUUUUUUUUUUUUUUUUUUUUTUURRPRRIM +QQ TGATACGGCGACCACCGAGCTCTACACTCTTTCC UUQUUUUUUUMUUUUUUQUUUUUUUOOOOOIO +OO AATTCTGCGCCCCCCCCACTCAGCCCCCCTCCCC LFNFQNQNFLQLIQQLIIQNOCIIIAAAAAHH +HA AGATACGGCCACCACCGAGATCTACACTCTCTCC NFQNIQLFQIFNQNQQFQQNNKKINAHAHH?A +HD TGATACGGCGACCACCGCGATCTACACTCTCTCC UUUUUUUUUUUUUUUUTLUUQUUCUPRRRRHR +NQ TGATCCGGCGACCCCCGAGCTCTACACTCTTTCC QQQQIQQIQQQQQNQQQQQLOOKNPHHHHHHD +HH TGCTCCGGCGACCACCGAGATCTACACTCTTTCC QQIQFQQNQQQQQIQQQLQLNOKIOHHHHHAD +HH TGCTCCGGCGACCACCGAGATCTACACTCTTTCC UIOUOULOUUUUUOUUUOUUUUUUULLKLLIG +LL TGCTCCGGCGACCACCGAGATCTACACTCTTTCC ULOUIUOUUUUUUUUUULUUUUUUULLLLLIG +LL GTCTCCTGCGACCCCCGAGCTCTACACTCTTTCC QLLQIQIFQNQQQIQQNQNLOONNOHHHHHHH +HH TTCTCCTTCGACCACCGAGATCTACACTCTTTCC QLNQIQLIQINQQQQQQLQQOPONOHHHHHHH +HH TTCTCCTTCGACCACCGAGATCTACACTCTTTCC UOUUIUOIUILUUUUUULUUUUUUULLLLLKL +LL


---
neversaint and everlastingly indebted.......

Comment on Memory Efficient Alternatives to Hash of Array
Download Code
Re: Memory Efficient Alternatives to Hash of Array
by tilly (Archbishop) on Dec 27, 2008 at 06:41 UTC
    You are trying to process 4-5 GB of data in Perl, which probably can only address 2-3 GB of data internally. That won't work. Which means that you want to keep data on disk. When processing data on disk you should always think about whether sorting helps. In this case sorting gives you all of the error rates for a given tag right after each other. So use the Unix sort utility or Sort::External to sort your data then process your file in one pass. That pass could have a snippet like this in it:
    my $last_key = ""; my @last_error_rates; while (my $line = <DATA>) { my ($key, $error_rate) = split /\s+/, $line; if ($key ne $last_key) { # We just crossed a key boundary, do processing. process_block($last_key, @last_error_rates) if $last_key; $last_key = $key; @last_error_rates = (); } push @last_error_rates, $error_rate } # Don't forget the final block! process_block($last_key, @last_error_rates);
Re: Memory Efficient Alternatives to Hash of Array
by wfsp (Abbot) on Dec 27, 2008 at 09:25 UTC
    Consider storing your HoA on disk.

    DBM::Deep is just the ticket for this type of job (large lookup tables).

      When someone is dealing with a large data set you need to trade off programmer efficiency against performance. While we in the Perl world are used to valuing programmer efficiency more, this stops being true with large datasets.

      Consider 5 GB of data broken up into 50 byte lines. So there are 100 million lines of data. Suppose we want to store that and retrieve it into DBM::Deep. For the sake of argument let's say that each store or retrieve takes one seek to disk. So that's 200 million seeks to disk.

      How long does 200 million seeks to disk take? Well suppose that your disk spins at 6000 rpm. (This is typical.) That means it spins 100 times per second. A disk seek will therefore take between 0 and 0.01 seconds, or 0.005 seconds on average. 200 million seeks therefore takes a million seconds. Which is 11.57 days, or a week and a half.

      Now how long does sorting that data take? Well let's assume an absurdly slow disk - 10 MB/s. (Real sorting algorithms keep a few passes in RAM and so will need fewer passes.) Suppose we code up a merge-sort and need 30 passes to disk. Each pass needs to read and wrote 5 GB. We therefore have 300 GB of throughput at 10 MB/s which will take 30,000 seconds, or a bit over 8 hours. (If your machine really takes this long to sort this much data, you should upgrade to a machine from this millennium.)

      The moral? Hard drives are not like RAM. DBM::Deep and friends are efficient for programmers, but not for performance. If you have existing complex code that needs to scale, consider using them. But it is worth some programmer effort to stay away from them.

        That's a pretty detailed explanation.

        But am not very sure about the conclusion you have stated.

        Based on your comment, it seems sorting ( whatever be the size of the dataset ) is going to take much lesser time compared to other methods like DBM::Deep

        So, what exactly is the demarcating line between when to use 'sorting' and when to use DBM::Deep (for example)?

        Would you mind elaborating on that? Thanks
      As the maintainer of DBM::Deep, I need to echo tilly here. While dbm-deep can allow you to address such large datasets (>4G requires 64-bit Perl just because 32 bits only addresses 4G and change), it is going to be very slow. Just building the 4G file can take a very long time. Sorting a large array (>10k entries, roughly) in dbm-deep will take a very very long time.

      While one of dbm-deep's use cases is dealing with data that's too large to fit in RAM, you really want to have that happen with lookups, not when dealing with large datasets. There are other languages and setups better suited for this, such as Erlang or CouchDB.


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Memory Efficient Alternatives to Hash of Array
by baxy77bax (Chaplain) on Dec 27, 2008 at 11:11 UTC
    well, i don't know if you are familiar with SQLite (or MySQL), i often get stuck with processing large amount of data, at first i also tried to do it through hashes and arrays, but then i just gave up on it. my advise would be to use one of the freely available databases like SQLite or MySQL. just import data, sort it (order by), and when processing it retrieve line by line of data and do what ever you want to do with it.

    this is memory efficient, but certainly slower way to do it

    SQLite

    example on how to use it through DBI and it would be usfull to check out DBD::SQLite

    plus there are tons of concrete examples on perlmonks !!!

      Databases are effective in this case because they understand the importance of streaming data to/from disk rather than using it as RAM. As for why, see Re^2: Memory Efficient Alternatives to Hash of Array. However it is fairly easy for you to acquire the same knowledge, which basically comes down to knowing to sort then process. Armed with that knowledge, the database slows me down slightly for simple tasks. For complex tasks, the database saves some logic but may come up with an unworkable query plan. I'll therefore give it a try, but push comes to shove I'm willing to go out of the database because I know that I can make it work, and sometimes the database simply won't.
Re: Memory Efficient Alternatives to Hash of Array
by BrowserUk (Pope) on Dec 27, 2008 at 12:07 UTC

    Hm. presumably, you've only used <DATA> by way of example, as Perl would die just trying to load the script if it was 4GB+ in size.

    Next. Why are you using a HoAs? On the basis of what you've posted, you have one key and one value per key, so wrapping that one value in an array just uses ~50% more memory than needed!

    That is, changing:push @{ $hold{$elem[0]} }, $elem[1];

    to $hold{ $elem[0] } = $elem[1]; would contain the same information but use 50% less memory to do so.

    But either way, you've still got too much data to hold in memory on a 32-bit machine, and (on the basis of your script(s to date)), as the only reason for loading it is to sort it, you'd be far better off sorting it (the input file) externally and processing it line by line.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      FYI Perl stops processing when it sees __DATA__ so there would be no problem loading a script that is over 4 GB of size. As for the use of a hash of arrays, reading the post I would assume a badly chosen data sample rather than a misunderstanding.

      Update: Good catch, eye. The sample ws well chosen.

        ...I would assume a badly chosen data sample...
        Actually, the OP's example has three sets of duplicate tags:
        Lines 6 - 9: TGATACGGCGACCACCGAGATCTACACTCTTTCC Lines 15 - 17: TGCTCCGGCGACCACCGAGATCTACACTCTTTCC Lines 19 - 20: TTCTCCTTCGACCACCGAGATCTACACTCTTTCC
        As for the use of a hash of arrays, reading the post I would assume a badly chosen data sample rather than a misunderstanding.

        Given the OPs description of the code: "My code below, tries to group the 'error_rate' (second column of data) based on its corresponding tag (first column of data).", in conjunction with that the second column appears to be a byte-wise mask for the first:

        AATACGGCCACCCCCCCCCCCCCCGCCCCTCCCC INILILFIIIIQNQQNQNLLKFKNCDHA?DAHHH

        I don't think it is just badly chosen sample data. Maybe the OP will tell us which is correct?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Memory Efficient Alternatives to Hash of Array
by Jenda (Abbot) on Dec 28, 2008 at 20:42 UTC

    Looks to me like you could pack the data quite a bit. The tags seem to contain just four letters (ACTG) ... which means you need just two bits per letter, that's 34*2=68 bits per tag which fits into 9 bytes instead of the original 34. Not sure what's allowed in the Error_rate_In_ASCII, but it looks there's quite a bit less than 256 possible characters in each position. So you could pack these as well.

    This way you can save quite a lot of space and the comparison of the packed strings will also be quicker. Assuming the number of Error_rates for each Tag is not too big, it might also be better to use

    $data{$packed_tag} .= $packed_rate . "\n";
    instead of
    push @{$data{$packed_tag}}, $packed_rate;
    which will also let you use DB_File or some other on disk hash without the overhead of the multilevel ones like DBM::Deep.

Re: Memory Efficient Alternatives to Hash of Array
by dragonchild (Archbishop) on Dec 29, 2008 at 03:27 UTC
    While Perl has become the language du jour for bioinformatics, something of this size will benefit from using something like CouchDB or BigTable. In general, I would look at Erlang. Most bio problems are massively parallelizable and that's one of the things Erlang was specifically designed to do.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://732751]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2014-11-01 12:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    If a safe, affordable anti-ageing treatment that extended life indefinitely were to become available, would you take it?



    Results (3 votes), past polls