Memory Efficient Alternatives to Hash of Array

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
My code below, tries to group the 'error_rate' (second column of data) based on its corresponding tag (first column of data). The grouping is done using Hash Of Array (HoA). Later, I will process each of these groups with some functions (not shown here).

However I found that given a large dataset, HoA is not feasible.
Is there any better memory efficient alternative to hash of array to address the problem equivalently?

use strict;
use Data::Dumper;
use Carp;

my %hold;

while ( <DATA> ) {
    chomp;
    next if (/^\#/);
    my @elem = split(/\t/,$_);

    # Keep in Hash of Array
    # HoA here consumes so much memory
    push @{$hold{$elem[0]}}, $elem[1];

}               

# is there a better way to keep/process the array
# other than HoA like %hold ?


foreach my $key ( sort keys %hold  ) {

    my @ary = @{$hold{$key}};

    # then I will process @ary for each key above
    print "$key\n";

}               


# in practice there are ~4-5Gb of such lines below

__DATA__
#Tags                                Error_Rate_In_ASCII
AATACGGCCACCCCCCCCCCCCCCGCCCCTCCCC    INILILFIIIIQNQQNQNLLKFKNCDHA?DAH
+HH
CTTTCCCTCCACGACGCTCTTCCGCTCTCATGAT    QQIQQQQQIQQQIQQLQNQNOPNKIHHHAHHA
+AA
TCCACTCTTTCCCTACACGACGCTCTTCCGATCT    QFQFQQQQQQQQQQQQIQLFNNPONHHHHHDH
+HH
TCCCCTCTTTCCCTACACGACGCTCTTCCGATCT    UIUIUUUUUUUUUULUUUIOUKUNULLLLKKL
+LK
TGATACGGCGACCACCGAGATCATCACACTTTCC    UUUUUUUUUUUUUQUUTUUUUULLUKRHPKIH
+HO
TGATACGGCGACCACCGAGATCTACACTCTTTCC    UOIUIUUUUUUUUIUUUOUOUUUUUKLLLLIK
+KL
TGATACGGCGACCACCGAGATCTACACTCTTTCC    UUUUMUUUUUUUUIUUIUUQUUUUUOOOOOOO
+OO
TGATACGGCGACCACCGAGATCTACACTCTTTCC    UUUUUUUUUUUUUUUUUTUUUUUUURRRRRMP
+PQ
TGATACGGCGACCACCGAGATCTACACTCTTTCC    UUUUUUUUUUUUUUUUUUUUUUTUURRPRRIM
+QQ
TGATACGGCGACCACCGAGCTCTACACTCTTTCC    UUQUUUUUUUMUUUUUUQUUUUUUUOOOOOIO
+OO
AATTCTGCGCCCCCCCCACTCAGCCCCCCTCCCC    LFNFQNQNFLQLIQQLIIQNOCIIIAAAAAHH
+HA
AGATACGGCCACCACCGAGATCTACACTCTCTCC    NFQNIQLFQIFNQNQQFQQNNKKINAHAHH?A
+HD
TGATACGGCGACCACCGCGATCTACACTCTCTCC    UUUUUUUUUUUUUUUUTLUUQUUCUPRRRRHR
+NQ
TGATCCGGCGACCCCCGAGCTCTACACTCTTTCC    QQQQIQQIQQQQQNQQQQQLOOKNPHHHHHHD
+HH
TGCTCCGGCGACCACCGAGATCTACACTCTTTCC    QQIQFQQNQQQQQIQQQLQLNOKIOHHHHHAD
+HH
TGCTCCGGCGACCACCGAGATCTACACTCTTTCC    UIOUOULOUUUUUOUUUOUUUUUUULLKLLIG
+LL
TGCTCCGGCGACCACCGAGATCTACACTCTTTCC    ULOUIUOUUUUUUUUUULUUUUUUULLLLLIG
+LL
GTCTCCTGCGACCCCCGAGCTCTACACTCTTTCC    QLLQIQIFQNQQQIQQNQNLOONNOHHHHHHH
+HH
TTCTCCTTCGACCACCGAGATCTACACTCTTTCC    QLNQIQLIQINQQQQQQLQQOPONOHHHHHHH
+HH
TTCTCCTTCGACCACCGAGATCTACACTCTTTCC    UOUUIUOIUILUUUUUULUUUUUUULLLLLKL
+LL
[download]

---
neversaint and everlastingly indebted.......

Comment on Memory Efficient Alternatives to Hash of Array Download Code

Replies are listed 'Best First'.

Re: Memory Efficient Alternatives to Hash of Array
by tilly (Archbishop) on Dec 27, 2008 at 06:41 UTC

Sort::External

my $last_key = "";
my @last_error_rates;
while (my $line = <DATA>) {
  my ($key, $error_rate) = split /\s+/, $line;

  if ($key ne $last_key) {
    # We just crossed a key boundary, do processing.
    process_block($last_key, @last_error_rates) if $last_key;
    $last_key = $key;
    @last_error_rates = ();
  }

  push @last_error_rates, $error_rate
}

# Don't forget the final block!
process_block($last_key, @last_error_rates);
[download]

[reply]
[d/l]

Re: Memory Efficient Alternatives to Hash of Array
by wfsp (Abbot) on Dec 27, 2008 at 09:25 UTC

DBM::Deep is just the ticket for this type of job (large lookup tables).

[reply]

Re^2: Memory Efficient Alternatives to Hash of Array

by tilly (Archbishop) on Dec 27, 2008 at 14:41 UTC

Consider 5 GB of data broken up into 50 byte lines. So there are 100 million lines of data. Suppose we want to store that and retrieve it into DBM::Deep. For the sake of argument let's say that each store or retrieve takes one seek to disk. So that's 200 million seeks to disk.

How long does 200 million seeks to disk take? Well suppose that your disk spins at 6000 rpm. (This is typical.) That means it spins 100 times per second. A disk seek will therefore take between 0 and 0.01 seconds, or 0.005 seconds on average. 200 million seeks therefore takes a million seconds. Which is 11.57 days, or a week and a half.

Now how long does sorting that data take? Well let's assume an absurdly slow disk - 10 MB/s. (Real sorting algorithms keep a few passes in RAM and so will need fewer passes.) Suppose we code up a merge-sort and need 30 passes to disk. Each pass needs to read and wrote 5 GB. We therefore have 300 GB of throughput at 10 MB/s which will take 30,000 seconds, or a bit over 8 hours. (If your machine really takes this long to sort this much data, you should upgrade to a machine from this millennium.)

The moral? Hard drives are not like RAM. DBM::Deep and friends are efficient for programmers, but not for performance. If you have existing complex code that needs to scale, consider using them. But it is worth some programmer effort to stay away from them.

[reply]

Re^3: Memory Efficient Alternatives to Hash of Array

by matrixmadhan (Beadle) on Dec 27, 2008 at 17:38 UTC

[reply]

Re^4: Memory Efficient Alternatives to Hash of Array

by tilly (Archbishop) on Dec 27, 2008 at 18:24 UTC

Re^2: Memory Efficient Alternatives to Hash of Array

by dragonchild (Archbishop) on Dec 29, 2008 at 03:25 UTC

DBM::Deep

tilly

can

will

While one of dbm-deep's use cases is dealing with data that's too large to fit in RAM, you really want to have that happen with lookups, not when dealing with large datasets. There are other languages and setups better suited for this, such as Erlang or CouchDB.

My criteria for good software:

Does it work?
Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

[reply]

Re: Memory Efficient Alternatives to Hash of Array
by baxy77bax (Deacon) on Dec 27, 2008 at 11:11 UTC

this is memory efficient, but certainly slower way to do it

SQLite

example on how to use it through DBI and it would be usfull to check out DBD::SQLite

plus there are tons of concrete examples on perlmonks !!!

[reply]

Re^2: Memory Efficient Alternatives to Hash of Array

by tilly (Archbishop) on Dec 27, 2008 at 14:49 UTC

Re^2: Memory Efficient Alternatives to Hash of Array

[reply]

Re: Memory Efficient Alternatives to Hash of Array
by BrowserUk (Patriarch) on Dec 27, 2008 at 12:07 UTC

Hm. presumably, you've only used <DATA> by way of example, as Perl would die just trying to load the script if it was 4GB+ in size.

Next. Why are you using a HoAs? On the basis of what you've posted, you have one key and one value per key, so wrapping that one value in an array just uses ~50% more memory than needed!

That is, changing:push @{ $hold{$elem[0]} }, $elem[1];

to $hold{ $elem[0] } = $elem[1]; would contain the same information but use 50% less memory to do so.

But either way, you've still got too much data to hold in memory on a 32-bit machine, and (on the basis of your script(s to date)), as the only reason for loading it is to sort it, you'd be far better off sorting it (the input file) externally and processing it line by line.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]
[d/l]
[select]

Re^2: Memory Efficient Alternatives to Hash of Array

by tilly (Archbishop) on Dec 27, 2008 at 14:56 UTC

Update: Good catch, eye. The sample ws well chosen.

[reply]

Re^3: Memory Efficient Alternatives to Hash of Array

by eye (Chaplain) on Dec 27, 2008 at 20:19 UTC

...I would assume a badly chosen data sample...

Lines  6 -  9: TGATACGGCGACCACCGAGATCTACACTCTTTCC
Lines 15 - 17: TGCTCCGGCGACCACCGAGATCTACACTCTTTCC
Lines 19 - 20: TTCTCCTTCGACCACCGAGATCTACACTCTTTCC
[download]

[reply]
[d/l]

Re^3: Memory Efficient Alternatives to Hash of Array

by BrowserUk (Patriarch) on Dec 27, 2008 at 20:43 UTC

As for the use of a hash of arrays, reading the post I would assume a badly chosen data sample rather than a misunderstanding.

Given the OPs description of the code: "My code below, tries to group the 'error_rate' (second column of data) based on its corresponding tag (first column of data).", in conjunction with that the second column appears to be a byte-wise mask for the first:

AATACGGCCACCCCCCCCCCCCCCGCCCCTCCCC    
INILILFIIIIQNQQNQNLLKFKNCDHA?DAHHH
[download]

I don't think it is just badly chosen sample data. Maybe the OP will tell us which is correct?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]
[d/l]

Re^4: Memory Efficient Alternatives to Hash of Array

by tilly (Archbishop) on Dec 27, 2008 at 21:01 UTC

Re^5: Memory Efficient Alternatives to Hash of Array

by neversaint (Deacon) on Dec 28, 2008 at 00:34 UTC

Some notes below your chosen depth have not been shown here

Re: Memory Efficient Alternatives to Hash of Array
by Jenda (Abbot) on Dec 28, 2008 at 20:42 UTC

Looks to me like you could pack the data quite a bit. The tags seem to contain just four letters (ACTG) ... which means you need just two bits per letter, that's 34*2=68 bits per tag which fits into 9 bytes instead of the original 34. Not sure what's allowed in the Error_rate_In_ASCII, but it looks there's quite a bit less than 256 possible characters in each position. So you could pack these as well.

This way you can save quite a lot of space and the comparison of the packed strings will also be quicker. Assuming the number of Error_rates for each Tag is not too big, it might also be better to use

$data{$packed_tag} .= $packed_rate . "\n";
[download]

push @{$data{$packed_tag}}, $packed_rate;
[download]

HTH, Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]

Re: Memory Efficient Alternatives to Hash of Array
by dragonchild (Archbishop) on Dec 29, 2008 at 03:27 UTC

My criteria for good software:

Does it work?
Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks