srdst13 has asked for the wisdom of the Perl Monks concerning the following question:
I am working on making a hash of the frequency of all "words" of length "n" in a genome. The genome contains 3 billion bases (letters in the set ACTG where A pairs with T and G with C). Using simple perl hashes, I can do this for up to word size 12 with 4Gb of RAM. It seems like it might be possible to do better than this by using the fact that we know that each letter contains only 2 bits of information. Can someone enlighten me about how perl does its hashing and what might be a better solution (more memory-efficient hash, while not sacrificing too much speed)?
An example of what the data look like for word size 11 is given here:
Word Count =========== ===== CAATGACTGAT 1052 AATGACTGATG 1426 ATGACTGATGT 1170 TGACTGATGTC 1105 GACTGATGTCC 781 ACTGATGTCCT 1148 CTGATGTCCTT 1468 TGATGTCCTTC 916 ...
Code is here:
sub index_file { my %params = @_; my $hashref = exists($params{hashref}) ? $params{hashref} : {}; my $file = $params{file}; my $window = $params{window}; open(INF,$file) or die "Cannot open file $file : $!"; print "Reading file....\n"; my $sequence; while (my $line = <INF>) { chomp($line); $sequence .= $line unless ($line=~/^>.*/); } close(INF); $sequence =~ tr/a-z/A-Z/; $sequence =~ s/N//g; my $offset=0; print "Calculating....\n"; for ($offset=0;$offset<length($sequence)-$window;$offset++) { print "$offset\n" if ($offset % 10000000 ==0); $hashref->{substr($sequence,$offset,$window)}++; } return($hashref); }
Thanks,
Sean