Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

A better (problem-specific) perl hash?

by srdst13 (Pilgrim)
on Feb 17, 2006 at 13:14 UTC ( #530932=perlquestion: print w/replies, xml ) Need Help??

srdst13 has asked for the wisdom of the Perl Monks concerning the following question:

I am working on making a hash of the frequency of all "words" of length "n" in a genome. The genome contains 3 billion bases (letters in the set ACTG where A pairs with T and G with C). Using simple perl hashes, I can do this for up to word size 12 with 4Gb of RAM. It seems like it might be possible to do better than this by using the fact that we know that each letter contains only 2 bits of information. Can someone enlighten me about how perl does its hashing and what might be a better solution (more memory-efficient hash, while not sacrificing too much speed)?

An example of what the data look like for word size 11 is given here:

    Word        Count
===========     =====
CAATGACTGAT     1052
AATGACTGATG     1426
ATGACTGATGT     1170
TGACTGATGTC     1105
GACTGATGTCC     781
ACTGATGTCCT     1148
CTGATGTCCTT     1468
TGATGTCCTTC     916
...

Code is here:

sub index_file { my %params = @_; my $hashref = exists($params{hashref}) ? $params{hashref} : {}; my $file = $params{file}; my $window = $params{window}; open(INF,$file) or die "Cannot open file $file : $!"; print "Reading file....\n"; my $sequence; while (my $line = <INF>) { chomp($line); $sequence .= $line unless ($line=~/^>.*/); } close(INF); $sequence =~ tr/a-z/A-Z/; $sequence =~ s/N//g; my $offset=0; print "Calculating....\n"; for ($offset=0;$offset<length($sequence)-$window;$offset++) { print "$offset\n" if ($offset % 10000000 ==0); $hashref->{substr($sequence,$offset,$window)}++; } return($hashref); }

Thanks,
Sean

Replies are listed 'Best First'.
Re: A better (problem-specific) perl hash?
by Corion (Pope) on Feb 17, 2006 at 14:07 UTC

    I'd go with writing a binary file to hold the counts. Every "word" of length n of your sequence can be interpreted as a number in the following way:

    Interpret the letters as digits:

    my %value = ( G => 0, A => 1, T => 2, C => 3, );

    A word @w then becomes a number by using the following polynomial:

    $number = $w[$n]*(4**$n) + $w[$n-1]*(4**($n-1)) + ... + $w[0]

    or more perlish

    sub word_to_number { my (@w) = @_; my $res = 0; while (@w) { $res = $res * 4; $res += $value{ pop @w }; }; };

    If you limit yourself to a maximum frequency of 2**31 (or 2**32), four bytes will be sufficient to store the count of each word. You then can just seek to the number of the word and increment the count in the file.

    open my $frequencies, ">+", 'frequencies.bin' or die "$!"; binmode $frequencies; ... while (words_are_available()) { my @word = split //, get_next_word(); my $offset = word_to_number(@word); seek $frequencies, $offset * 4; read $frequencies, my $old_count, 4; $old_count = unpack "N", $old_count; my $count = $old_count + 1; seek $requencies, $offset * 4; write $frequencies, pack "N", $count; };
      This is likely to be considerably slower than the in-memory hash version. If memory suffices I'd be inclined to do the same thing with Bit::Vector, perhaps via my ancient wrapper, Tie::IntegerArray.

      -sam

      Is there a paper behind that polynomial?
        Anonymous Monk,
        What Corion didn't explicitly state is that this is just base conversion. Just as in base 16, we let the letters A-F represent 10-15, Corion is letting A, C, T, G stand for values. It is then a fairly trivial math process to convert between bases. Corion shows one way but it isn't the only one.

        Cheers - L~R

        Is there a paper behind that polynomial?

        Paper? You mean like a white paper? No, it's too basic for that. It's standard introductory algebra, the sort of thing you have to be able to do in your sleep before you can take any higher math.

Re: A better (problem-specific) perl hash?
by BrowserUk (Pope) on Feb 17, 2006 at 16:11 UTC

    Using the basic idea that if you pack 2-bits for each character together to form a number, your 12-character words' can be represented by an number 0 .. 2**24, you can use that to index into an array of packed integers each requiring 4 bytes, you arrive at a storage requirement of 2**28 bytes/256 MB to cover your 12-bytes words.

    The transforming the words into integers can be done like this, (the quickest of 4 methods I tried):

    use List::Util qw[ reduce ]; sub xform{ my $word = shift; $word =~ tr[ACGT][\x00\x01\x02\x03]; reduce{ ($a << 2) + $b } 0, unpack 'C*', $word; }

    Then you need an implementation of a packed integer array. I tried a tied array, which has the convenience of standard array notation, $index[ xform( $word ) ]++;, but proves to be rather slow. Converting that to an OO representation was about 20% faster:

    package OO::Index; sub new { my $self; open RAM, '>', \$self; seek RAM, 2**28, 0; print RAM chr(0); close RAM; return bless \$self, $_[0]; } sub get { my( $self, $idx ) = @_; return unpack 'V', substr $$self, $idx*4, 4; } sub set { my( $self, $idx, $value ) = @_; substr $$self, $idx*4, 4, pack 'V', $value; return $value; } 1;

    A crude benchmark over 1e6 random 12-char keys shows this to be about 40% slower than a hash, but requires only 20% of the memory:

    #! perl -sw use strict; package OO::Index; sub new { my $self; open RAM, '>', \$self; seek RAM, 2**28, 0; print RAM chr(0); close RAM; return bless \$self, $_[0]; } sub get { my( $self, $idx ) = @_; return unpack 'V', substr $$self, $idx*4, 4; } sub set { my( $self, $idx, $value ) = @_; substr $$self, $idx*4, 4, pack 'V', $value; return $value; } return 1 if caller; package main; use Benchmark qw[ cmpthese ]; use List::Util qw[ reduce ]; $a = $b; sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } sub xform{ my $word = shift; $word =~ tr[ACGT][\x00\x01\x02\x03]; reduce{ ($a << 2) + $b } 0, unpack 'C*', $word; } cmpthese 5, { hash => q[ my %index; $index{ rndStr 12, qw[ A C G T ] }++ for 1 .. 1e6; ], oo => q[ my $index = new OO::Index; for ( 1 .. 1e6 ) { my $i = xform rndStr 12, qw[ A C G T ]; $index->set( $i, 1+ $index->get( $i ) ); } ], }; __END__ C:\test>junk s/iter oo hash oo 35.3 -- -58% hash 15.0 136% --

    HTH


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      What he said!

      Convert each gene string to a number.

      Depending on how much data you have, that isn't a hash.

      It's a (sparse) array.

      As long as your string length isn't too long that gives you a fixed cost storage, for as much data as you like (as long as it doesn't overflow the array element size)
        (as long as it doesn't overflow the array element size)

        With 16 meg combinations of 12 bps, and a 3 billion (-11 :) bps in all, it averages to 179 of each 12-char strings (assuming random data). I used a 4 bytes integer for each counter, so even if the sequence consisted of entirely one character, there is still plenty of headroom.

        From some previous analysis I did on the Drosophila genome, the numbers of repeat sequences are quite low. In it's 160MB, repeats counts for subsequences of length 8 are in the low thousands and by the time you get to 12-chars rarely above a few hundred though you do get some 12-char combinations that run to several thousand repeats, particularly those involving T & A for some reason?

        I coded a version of the above in XS that allows you to specify the element size as 2, 4 or 8 and it will detect attempts to assign values larger. I'm having trouble with conversion between __int64s and doubles for returning to Perl though.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        that isn't a hash, It's a (sparse) array.

        Actually, it's a Perfect Hash.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: A better (problem-specific) perl hash?
by Fletch (Chancellor) on Feb 17, 2006 at 13:49 UTC

    Yes, you could look at using vec to build a bit string representation of the strings. Another thing to look into would be using BerkeleyDB (or the like) to store your results into a DB file on disk rather than keeping the entire thing in memory.

Re: A better (problem-specific) perl hash?
by grinder (Bishop) on Feb 17, 2006 at 16:17 UTC

    I'm sure that someone must have already had this insight befoire, but on the basis that you can encode a, c, g, t in two bits, you can encode four of them in a byte. To save time, you just have to create a couple lookup tables to encode and decode at will.

    For instance, with the following code, the sequence 'ctaacgccctagcgta' encodes to four bytes whose ASCII representation is, surprise, 'perl'. Going back the other way, the encoded bytes whose ASCII representation is 'japh' comes out as 'cgggcgacctaacgga'.

    For this to work, you need sequences whose lengths are exact multiples of four. Ensuring that that is the case is left as an exercise to the reader.

    #! /usr/bin/perl -w use strict; my %lookup; my %invert; my @base = qw(a c g t); my $count = 0; for my $x1 (@base) { for my $x2 (@base) { for my $x3 (@base) { for my $x4 (@base) { my $key = chr($count++); my $chunk = "$x1$x2$x3$x4"; $invert{$key} = $chunk; $lookup{$chunk} = $key; } } } } for my $seq (@ARGV) { print "squeeze => ", squeeze($seq), "\n"; print "unsqueeze => ", unsqueeze($seq), "\n"; } sub squeeze { my $seq = shift; my $out = ''; $out .= $lookup{lc $1} while ($seq =~ /(....)/g); return $out; } sub unsqueeze { my $seq = shift; my $out = ''; $out .= $invert{$_} for split //, $seq; return $out; }

    I suspect that this is going to be as about as fast and efficient as you can get. Just take care when you transfer the encoded data to a UTF-8 system :)

    • another intruder with the mooring in the heart of the Perl

Re: A better (problem-specific) perl hash?
by salva (Canon) on Feb 17, 2006 at 16:35 UTC
    my %map; @map{qw(AA AC AT AG CA CC CT CG TA TC TT TG GA GC GT GG A C T G)} = (0..9, 'a'..'f', 0..3); sub bitkey { my $key = shift; $key =~ s/(..?)/$map{$1}/g; return pack "H*" => $key; }
Re: A better (problem-specific) perl hash?
by brian_d_foy (Abbot) on Feb 17, 2006 at 18:37 UTC

    This is one of the problems I was playing around with in my Mastering Perl chapter on tied variables (not that tie is the solution, just that the chapter was on tied variables). You can see the early version of all that stuff on my Mastering Perl website.

    The upcoming issue of The Perl Review then attacks the same problem with bit vectors since in another article Eric Maki used the technique for Sudoku puzzles. If you're really in a bind I'll send you the draft of the article. :)

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
Re: A better (problem-specific) perl hash?
by l3v3l (Monk) on Feb 17, 2006 at 19:42 UTC
    In the vein of perl liners and cmd line tools that do "simple" tasks and work with other liners to do more interesting tasks, I had put the following together to do just this sort of thing but have not run on really big files and I am sure this can be improved but it works great for fasta input files ~300K or so ...
    perl -nle 'chomp;tr/a-z/A-Z/;next if /[^ATCG]/;$l=length();$c=0;for $m +(5..$l){print substr($_,$c++,5)}' uniBNL_SSR_fasta.txt.input_library. +txt | perl -nle '$c{$_}++ for split/\W/;}print map {"$_:$c{$_}\n"} so +rt{$c{$b}<=>$c{$a}}keys %c;{' -
    output example:
    ... AGAGA:5334 GAGAG:5124 TCTCT:1938 AAAAA:1908 ACACA:1884 TTTTT:1851 CTCTC:1798 ...
    more details - for genome type data sets I split the bases up into N files and run individual jobs on N nodes of a cluster - if anyone is interested - here is what I use to split the large files into N files (thanks thor - custom version of what we discussed):
    #!/usr/bin/perl # # simple fasta input file cleaver # usage: cleaver.pl --format OUTPUT_FILE_PREFIX --number X input_file # where X is the number of files to split input_file into # # note: you can optionally pass more than one input file # to (re)combine into X files use strict; use warnings; use Getopt::Long; my ($number,$format)=(2,"output"); GetOptions( "number=i" => \$number, "format=s" => \$format, ); my @output_file; foreach my $num(1..$number) { my $file ="$format.$num"; open($output_file[$num-1],">",$file) or die; } my $file_num = 0; while(<>) { $/='>'; chomp; my $pre = ($. != 2 ? ">" : ""); print {$output_file[$file_num]} "$pre$_"; next if $. == 1; $file_num +=1; $file_num %= $number; }
Re: A better (problem-specific) perl hash?
by salva (Canon) on Feb 18, 2006 at 20:23 UTC
    A completely different aproach:

    This method allows to find duplicates on the string and doesn't suffer from memory limitations, though it could be much slower than the hash based solution.

    The trick is to do a block sorting on the sequence on chunks of limited size. The sorted indexes of every chunk are saved to disk on different files and finally merged sequentially to look for duplicates.

    Also, once the block sorting process has been done, looking for repeated sequences of different lenghts is relatively cheap.

    use strict; use warnings; my $size = 12; use Sys::Mmap; use Sort::Key::Merger 'keymerger'; $| = 1; open BIG, '<', 'big.txt' or die "unable to open big.txt"; mmap(my $big, 0, PROT_READ, MAP_SHARED, *BIG) or die "unable to mmap big.txt"; my $chunk = 8 * 1024 * 1024; my @files; my $n = 0; my $lenbig = length $big; for (my $i = 0; $i < $lenbig; $i += $chunk) { my $top = $i + $chunk; $top = $lenbig if $lenbig < $top; print STDERR "sorting from $i to $top\n"; my @s = sort { ( substr($big, $a, 30) cmp substr($big, $b, 30) or substr($big, $a, 10_000) cmp substr($big, $b, 10_000) or substr($big, $a, 16_000_000) cmp substr($big, $b, 16_000_ +000) or substr($big, $a) cmp substr($big, $b) ) } $i..($top); my $file = "small.$i"; push @files, $file; open F, '>', $file or die "unable to open $file"; print F pack(N => $_) for @s; close F; } my @fh; for (@files) { open my $fh, '<', $_ or die "unable to open $_"; push @fh, $fh; } my $merger = keymerger { if (read $_, my $data, 4) { my $i = unpack N => $data; return (substr($big, $i, $size), $i); } else { return () } } @fh; my $last = $merger->(); my $count = 1; while (defined(my $i = $merger->())) { # print substr($big, $i, $size), "\n"; if (substr($big, $i, $size) eq substr($big, $last, $size)) { $count++; } else { if ($count > 1) { print substr($big, $last, $size), " appears $count times\n"; } $last = $i; $count = 1; } } if ($count > 1) { print substr($big, $last, $size), " appears $count times\n"; }
Re: A better (problem-specific) perl hash?
by larva (Initiate) on Feb 19, 2006 at 14:55 UTC
    this node, and subsequent responses has provided me with much intellectual nutrients...many thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://530932]
Approved by Corion
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2021-06-16 11:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (74 votes). Check out past polls.

    Notices?