http://www.perlmonks.org?node_id=530932

srdst13 has asked for the wisdom of the Perl Monks concerning the following question:

I am working on making a hash of the frequency of all "words" of length "n" in a genome. The genome contains 3 billion bases (letters in the set ACTG where A pairs with T and G with C). Using simple perl hashes, I can do this for up to word size 12 with 4Gb of RAM. It seems like it might be possible to do better than this by using the fact that we know that each letter contains only 2 bits of information. Can someone enlighten me about how perl does its hashing and what might be a better solution (more memory-efficient hash, while not sacrificing too much speed)?

An example of what the data look like for word size 11 is given here:

    Word        Count
===========     =====
CAATGACTGAT     1052
AATGACTGATG     1426
ATGACTGATGT     1170
TGACTGATGTC     1105
GACTGATGTCC     781
ACTGATGTCCT     1148
CTGATGTCCTT     1468
TGATGTCCTTC     916
...

Code is here:

sub index_file { my %params = @_; my $hashref = exists($params{hashref}) ? $params{hashref} : {}; my $file = $params{file}; my $window = $params{window}; open(INF,$file) or die "Cannot open file $file : $!"; print "Reading file....\n"; my $sequence; while (my $line = <INF>) { chomp($line); $sequence .= $line unless ($line=~/^>.*/); } close(INF); $sequence =~ tr/a-z/A-Z/; $sequence =~ s/N//g; my $offset=0; print "Calculating....\n"; for ($offset=0;$offset<length($sequence)-$window;$offset++) { print "$offset\n" if ($offset % 10000000 ==0); $hashref->{substr($sequence,$offset,$window)}++; } return($hashref); }

Thanks,
Sean