Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

how can I speed up this perl??

by Anonymous Monk
on Nov 24, 2003 at 10:49 UTC ( #309487=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have written a script that operates on long sequences (e.g. 3 million characters).

The program is far slower than expected and I need it to be a lot quicker. I used the Benchmark module to determine which part of the code was being soo slow and it turned out to be this bit below;

Does anyone know of a way to speed this bit of code up? It is just calculating frequencies of different base pairs in a sequence. Cheers!

if (($genome[$i] eq 'a') && ($genome[$i+1] eq 'a')) { ++$tt; } elsif (($genome[$i] eq 'a') && ($genome[$i+1] eq 'g')) { ++$ +ag; } elsif (($genome[$i] eq 'a') && ($genome[$i+1] eq 'c')) { ++$ +ac; } elsif (($genome[$i] eq 'a') && ($genome[$i+1] eq 't')) { ++$ +at; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 'a')) { ++$ +ta; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 'g')) { ++$ +tg; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 'c')) { ++$ +ga; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 't')) { ++$ +tt; } elsif (($genome[$i] eq 'c') && ($genome[$i+1] eq 'a')) { ++$ +tg; } elsif (($genome[$i] eq 'c') && ($genome[$i+1] eq 'g')) { ++$ +cg; } elsif (($genome[$i] eq 'c') && ($genome[$i+1] eq 'c')) { ++$ +cc; } elsif (($genome[$i] eq 'c') && ($genome[$i+1] eq 't')) { ++$ +ag; } elsif (($genome[$i] eq 'g') && ($genome[$i+1] eq 'a')) { ++$ +ga; } elsif (($genome[$i] eq 'g') && ($genome[$i+1] eq 'g')) { ++$ +cc; } elsif (($genome[$i] eq 'g') && ($genome[$i+1] eq 'c')) { ++$ +gc; } elsif (($genome[$i] eq 'g') && ($genome[$i+1] eq 't')) { ++$ +ac; }

Comment on how can I speed up this perl??
Download Code
Re: how can I speed up this perl??
by Abigail-II (Bishop) on Nov 24, 2003 at 11:00 UTC
    Use a hash.
    $counts {$genome [$i] . $genome [$i + 1]} ++;
    Note: the line above assumes that you are using separate counters for "aa" and "tt", unlike your own code.

    Abigail

      Thanks Abigail-II, but i'm new and dont get how this.. where do I define each pair e.g. $counts == aa. ? how does this counter know what to look for?
        It's a hash. If you encounter "aa", it'll add 1 to its "aa" entry. If you encounter "cg", it'll add 1 to its "cg" entry, etc.

        Abigail

      According to the original code there are six diferent instances which go to the same counters. Of course this is no problem, as you can combine the counters after the solution provided by Abigail-II like this:

      $counts{tt} += $counts{aa}; $counts{ag} += $counts{ct}; $counts{ac} += $counts{gt}; $counts{tg} += $counts{ca}; $counts{ga} += $counts{tc}; $counts{cc} += $counts{gg};
Re: how can I speed up this perl??
by Roger (Parson) on Nov 24, 2003 at 11:05 UTC
    Ummm, why don't you put the results into a hash instead?
    my %freq; ... my $token = $genome[$i] . $genome[$i+1]; $freq{$token}++;
    For example, if your input sequence is 'abbca', while you step through the sequence, you will get -
    $freq{ab}++; $freq{bb}++; $freq{bc}++; $freq{ca}++;
A less extreme change..
by TravelByRoad (Acolyte) on Nov 24, 2003 at 14:28 UTC
    Short of going to a completely different hash implementation, there are incremental improvements you can make to improve your code... You can factor out the indexing operations: Instead of
    if ( ($genome[$i] eq 'a') && ($genome[$i+1] eq 'a')) { ++$tt; } elsif ( ($genome[$i] eq 'a') && ($genome[$i+1] eq 'g')) { ++$ag; } elsif ( ($genome[$i] eq 'a') && ($genome[$i+1] eq 'c')) { ++$ac; } elsif ( ($genome[$i] eq 'a') && ($genome[$i+1] eq 't')) { ++$at; } elsif ( ($genome[$i] eq 't') && ($genome[$i+1] eq 'a')) { ++$ta; } ...
    use...
    my $genome = $genome[$i]; my $genome1 = $genome[$i+1]; if ( ($genome eq 'a') && ($genome1 eq 'a')) { ++$tt; } elsif ( ($genome eq 'a') && ($genome1 eq 'g')) { ++$ag; } elsif ( ($genome eq 'a') && ($genome1 eq 'c')) { ++$ac; } elsif ( ($genome eq 'a') && ($genome1 eq 't')) { ++$at; } elsif ( ($genome eq 't') && ($genome1 eq 'a')) { ++$ta; } ...
    Finding the array elements is done only once, rather than once per comparison. The next improvement would be to concatenate the two elements and halve the number of comparisons try...
    my $genomepair = $genome[$i] . $genome[$i+1]; if ( $genomepair eq 'aa' ) { ++$tt; } elsif ( $genomepair eq 'ag') { ++$ag; } elsif ( $genomepair eq 'ac') { ++$ac; } elsif ( $genomepair eq 'at') { ++$at; } elsif ( $genomepair eq 'ta') { ++$ta; } ...
      Immediately, what seems to be a bug comes out.
      if ( $genomepair eq 'aa' ) { ++$tt; }

      If it's 'aa', why increment $tt? If it's not a bug, then it needs a comment explaining why it doesn't follow the pattern.

      Continuing with the refactoring, once you have it in the second form listed, you can easily move to a hash.

      my %pair_count; my $genomepair = $genome[$i] . $genome[$i+1]; $pair_count{$genomepair}++;

      ------
      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: how can I speed up this perl??
by pg (Canon) on Nov 24, 2003 at 15:39 UTC

    To use a hash as all other monks have suggested surely makes your code much more beautiful and structured.

    If I speak strictly to the speed problem, the major issue is that you are repeating the [] operation unneccessarily, when you only need it twice. String operation is very expensive.

    You should be able to speed your code up, simply by doing this right before your if-else chain:

    my $a = $genome[$i]; my $b = $genome[$i + 1];

    And in the subsequent code, only use $a and $b, not [] operation any more.

    This is a direct answer to your speed issue. Don't get me wrong, you still should use hash as your storage, as it not only makes your code more structured, as a matter fact, but also removes the unnecessary usage of [] operation.

      Your suggestion will save some, but just some. You've replaced the fetching of a value from walking 4 pointers to walking 2 pointers. That's a peephole optimization; the big gain is to be made by not comparing so much.

      Even if you don't want to use a hash, you can still do:

      if ($genome [$i] eq 'a') { if ($genome [$i + 1] eq 'a') {$aa ++} elsif ($genome [$i + 1] eq 'c') {$ac ++} elsif ($genome [$i + 1] eq 'g') {$ag ++} elsif ($genome [$i + 1] eq 't') {$at ++} } elsif ($genome [$i] eq 'c') { if ($genome [$i + 1] eq 'a') {$ca ++} elsif ($genome [$i + 1] eq 'c') {$cc ++} ... etc ...
      which reduces the number of comparisons from max 20 to max 5.

      Abigail

        Yep, obviously the comparison is also a part of the problem. As a matter of fact, both of our posts in this sub-thread shall only be understood as performance analysis, the actual implementation still shall go after hash, which has everything resolved in one shot.

Re: how can I speed up this perl??
by thospel (Hermit) on Nov 24, 2003 at 15:40 UTC
    Do you happen to come from a C background ? In perl it's almost never a good idea to process a string as an array of characters. Building that array takes time and a lot of memory, and the operations you then do on it are rarely natural or fast (ok, a simple character walk is fast).

    So the best way to solve this is to actually have the sequence in a plain string, and walk that string in a perlish way. Here substr() seems a good operator (you can also use a pattern match that selects 2 chars at a time, but that turns out to be slower).

    Making perl fast is also to a great extent reducing the amount of opcodes that get executed. So your multiple if tests want to be replaced by something that takes less operations. As pointed out by the other answers, you can use a hash here. While a hash lookup is slightly more work than a simple test, it only has to be done once.

    So the code becomes:

    my %count; $count{substr($genome, $_, 2)}++ for 0..length($genome)-2; # Next combine the counts like in the original code # is this a bug or intentional ? $count{tt} += $count{aa}; $count{ag} += $count{ct}; $count{ac} += $count{gt}; $count{tg} += $count{ca}; $count{ga} += $count{tc}; $count{cc} += $count{gg};
    This is about 4 times as fast as an array based solution using a hash on my perl.

    If this still isn't fast enough, you can start looking at things like Inline::C.

      Yeah, I also like the idea of using strings instead of arrays. When I read the problem, my first thought was to try something like...
      $genome=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg;
      ...does anyone else tend to (ab)use the replacement operator as sort of a map which works on strings?
        Why do the lookahead? Why not just
        $genome =~ s/(..)/++$count{$1} and undef/eg;

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        I use it a lot in perlgolf, but not in real code since it tends to destroy the string (notice that your code does, though you tried to avoid that I think). In real code using a while on a regex in scalar context is slightly faster and not so dangerous. When programming an inner loop and trying to be blazingly fast, you also should avoid things like $1.$2 since constructing a new value is a relatively expensive operation in perl. So the regex variant I'd use is:
        $count{$1}++ while $genome =~ /(?=(..))./g
        But this is still twice as slow as the substr() variant.

        You could also use unpack

        print unpack '(A2X)*', 'abcdefghijklmnopqrstuvwxyz'; ab bc cd de ef fg gh hi ij jk kl lm mn no op pq qr rs st tu uv vw wx x +y yz z

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!

        No, and for good reason - its very slow compared to a normal solution. That "e" stands for "eval string", which is pretty sluggish, and should only be used for good reason. Compare:

        use Benchmark; use strict; our $dummy = 0; our %count; my $gen = ""; foreach my $l (qw|a t c g|) { $gen .= "$l$_" for qw|a t c g|; } my @gen = split '', $gen; my %tests; foreach my $test ( qw|chain subst hash_array hash_string| ) { no strict refs; $tests{$test} = sub { *{$test}{CODE}->($gen,\@gen) } } timethese(100000, \%tests); sub chain { my ($gen,$genome) = @_; for my $i (0..$#$genome) { if (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 't')) { + ++$dummy; } } } sub subst { my ($gen,$genome) = @_; $gen=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg; } sub hash_array { my ($gen,$genome) = @_; $count { $genome -> [$_] . $genome -> [$_ + 1] } ++ for (0..$#$gen +ome); } sub hash_string { my ($gen,$genome) = @_; $count { substr($gen, $_, 2) }++ for (0..length($gen)-2); } __DATA__ Benchmark: timing 100000 iterations of chain, hash_array, hash_string, + subst... chain: 64 wallclock secs (41.09 usr + 0.00 sys = 41.09 CPU) @ 2 +433.68/s (n=100000) hash_array: 14 wallclock secs (11.24 usr + 0.00 sys = 11.24 CPU) @ 8 +896.80/s (n=100000) hash_string: 8 wallclock secs ( 6.26 usr + 0.00 sys = 6.26 CPU) @ 1 +5974.44/s (n=100000) subst: 17 wallclock secs (16.60 usr + 0.00 sys = 16.60 CPU) @ 6 +024.10/s (n=100000)
        I tend to love to do it! (And I do not consider it abuse... I once wrote a brainf*ck interpretter that was just one big s/.../.../eg. Now who could call that abuse?!? ;-D)

        ------------
        :Wq
        Not an editor command: Wq
      I'd thought I'd whip up a little test to compare the perl hash based solution to an Inline-C solution. Looks like the C version is about 85x faster...
      Benchmark: timing 20 iterations of hash_string, inline...
      hash_string: 37 wallclock secs (37.08 usr +  0.01 sys = 37.09 CPU) @  0.54/s (n=20)
          inline:  1 wallclock secs ( 0.44 usr +  0.00 sys =  0.44 CPU) @ 45.45/s (n=20)
      #!/usr/bin/perl use Inline C; use Benchmark; my $gen = "atgcgc"x500000; #3 million characters $tests{"inline"} = sub { string_inline_c($gen, length($gen)) }; $tests{"hash_string"} = sub { hash_string($gen) }; timethese(20, \%tests); sub hash_string { my ($genome) = @_; my %count; $count{ substr($genome, $_, 2) }++ for (0..length($genome)-2); } __END__ __C__ int string_inline_c(char *genome, int len) { int i; int hash[96]; /* The hashing function is simply 4*(first char - 'a') + second ch +ar - 'a' */ /* i.e. the bucket for gg is 4*('g'-'a')+'g'-'a' = 30 */ /*initialize hash buckets which will get used*/ /*aa*/ /*ac*/ /*ag*/ /*at*/ hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0; /*ca*/ /*cc*/ /*cg*/ /*ct*/ hash[ 8] = hash[10] = hash[14] = hash[27] = 0; /*ga*/ /*gc*/ /*gg*/ /*gt*/ hash[24] = hash[26] = hash[30] = hash[43] = 0; /*ta*/ /*tc*/ /*tg*/ /*tt*/ hash[76] = hash[78] = hash[82] = hash[95] = 0; for(i=0;i<len-1;i++) { hash[4*(genome[i]-'a')+(genome[i+1]-'a')]++; } /* returning the proper perl hash is left as an */ /* exercise for the reader */ /* see also the Inline-C Cookbook */ return(1); }
        Just thought I finish off the code by actually returning the hash back to perl...
        #!/usr/bin/perl use Inline C; use Benchmark; my $gen = "atgcgc"x500000; #3 million characters my $h_ref; $tests{"inline"} = sub { $h_ref = string_inline_c($gen, length($gen)) +}; $tests{"hash_string"} = sub { hash_string($gen) }; timethese(2, \%tests); sub hash_string { my ($genome) = @_; my %count; $count{ substr($genome, $_, 2) }++ for (0..length($genome)-2); } __END__ __C__ SV* string_inline_c(char *genome, int len) { int i; int hash[96]; HV* perl_hash=newHV(); /* The hashing function is simply 4*(first char - 'a') + second ch +ar - 'a' */ /* i.e. the bucket for gg is 4*('g'-'a')+'g'-'a' = 30 */ /*initialize our 'C' hash buckets which will get used*/ /*aa*/ /*ac*/ /*ag*/ /*at*/ hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0; /*ca*/ /*cc*/ /*cg*/ /*ct*/ hash[ 8] = hash[10] = hash[14] = hash[27] = 0; /*ga*/ /*gc*/ /*gg*/ /*gt*/ hash[24] = hash[26] = hash[30] = hash[43] = 0; /*ta*/ /*tc*/ /*tg*/ /*tt*/ hash[76] = hash[78] = hash[82] = hash[95] = 0; for(i=0;i<len-1;i++) { hash[4*(genome[i]-'a')+(genome[i+1]-'a')]++; } /*move our values over from the 'C' hash to the perl hash*/ #define h(c,i) (hv_store(perl_hash, (c), sizeof((c))-1, newSViv(hash[( +i)]), 0)) h("aa", 0); h("ac", 2); h("ag", 6); h("at",19); h("ca", 8); h("cc",10); h("cg",14); h("ct",27); h("ga",24); h("gc",26); h("gg",30); h("gt",43); h("ta",76); h("tc",78); h("tg",82); h("tt",95); return newRV_noinc((SV*) perl_hash); /*return a ref to a hash*/ }
Re: how can I speed up this perl??
by Roy Johnson (Monsignor) on Nov 24, 2003 at 15:57 UTC
    Late to the party, again, but one thing nobody has commented on is that your if-structure re-tests the same thing multiple times. Instead of:
    if (($genome[$i] eq 'a') && ($genome[$i+1] eq 'a')) { ++$tt; } elsif (($genome[$i] eq 'a') && ($genome[$i+1] eq 'g')) { ++$a +g; } elsif (($genome[$i] eq 'a') && ($genome[$i+1] eq 'c')) { ++$a +c; } elsif (($genome[$i] eq 'a') && ($genome[$i+1] eq 't')) { ++$a +t; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 'a')) { ++$t +a; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 'g')) { ++$t +g; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 'c')) { ++$g +a; } elsif (($genome[$i] eq 't') && ($genome[$i+1] eq 't')) { ++$t +t; }
    you should have had:
    if ($genome[$i] eq 'a') { if ($genome[$i+1] eq 'a') { ++$tt; } elsif ($genome[$i+1] eq 'g') { ++$ag; } elsif ($genome[$i+1] eq 'c') { ++$ac; } elsif ($genome[$i+1] eq 't') { ++$at; } } elsif ($genome[$i] eq 't') { if ($genome[$i+1] eq 'a') { ++$ta; } elsif ($genome[$i+1] eq 'g') { ++$tg; } elsif ($genome[$i+1] eq 'c') { ++$ga; } elsif ($genome[$i+1] eq 't') { ++$tt; }
    etc.

    Of course, the hash is a better solution for this problem, but I thought that the redundant testing problem should be pointed out.

Re: how can I speed up this perl??
by Art_XIV (Hermit) on Nov 24, 2003 at 16:02 UTC

    The following provides a not-very-elegant but executable implementation of the hash-counting that others have been hinting at:

    use strict; use Data::Dumper; my @simple_pairs = qw(ag at ta cg gc); #'mirrored pairs' are pairs that will (eventually) #increment a 'base pair' w/ the same index; my @base_pairs = qw(ac tg tt cc ga); my @mirrored_pairs = qw(gt ca aa gg tc); my %counts; #the three arrays will be 'flattened' in the for loop $counts{$_} = 0 for (@simple_pairs, @base_pairs, @mirrored_pairs); while (<DATA>) { chomp; my $seq = $_; #use if-else to warn of unwanted/bad sequences if (exists $counts{$seq}) { $counts{$seq}++; } else { warn "Unknown pair: $_\n"; } } print "Before consolidating mirrored pairs:\n"; print Dumper(%counts); for (0..$#base_pairs) { $counts{$base_pairs[$_]} += $counts{$mirrored_pairs[$_]}; delete $counts{$mirrored_pairs[$_]}; } print "\nAfter consolidating mirrored pairs:\n"; print Dumper(%counts); 1; __DATA__ ag gc qa gt ca cg gt ca

    This should be quite a bit peppier than the if-else blocks that you were using.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: how can I speed up this perl??
by TomDLux (Vicar) on Nov 24, 2003 at 20:33 UTC

    With a tiny bit of pre-processing, you can automatically process characters in pairs:

    my ( $prev, %pairs ); map { $pairs{ $prev . $_}++; $prev = $_; } @genome; delete $pairs{a}, $pairs{c}, $pairs{t}, $pairs{g};

    Each character of the genome is combined in turn with the previous character to form a pair, and the corresponding entry in %pairs is incremented. Then the current character is saved in $prev to be the previous character for the next time around. Of course, for the first character there is no previous character, so there will be a dummy entry in %pairs with one of the keys 'a', 'c', 't', or 'g'. So, oncec we're all done, delete those four entries ..... so what if we deelete three entrties that don't exist.

    Yes, map may be a little complicated for beginners to understand, so it may deserve a brief comment---say the previous paragraph---but it's clean and simple, and short code introduces fewer opportunities for mistakes.

    Update: It's poor style to use map just for the side effects, tossing aside the return values. So it might be better to expand that line into an actual loop:

    my ( $prev, %pairs ); for ( @genome ) { $pairs{ $prev . $_}++; $prev = $_; }; delete $pairs{a}, $pairs{c}, $pairs{t}, $pairs{g};

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: how can I speed up this perl??
by Anonymous Monk on Nov 24, 2003 at 20:46 UTC
    Well, if @genome were a string, then this would work fairly well and not require keeping track of more variables than needed:
    my ( $ac, $ag, $at, $cc, $cg, $ga, $gc, $ta, $tg, $tt ) = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ); my %seq = ( aa => \$tt, ag => \$ag, ac => \$ac, at => \$at, ta => \$ta, tg => \$tg, tc => \$ga, tt => \$tt, ca => \$tg, cg => \$cg, cc => \$cc, ct => \$ag, ga => \$ga, gg => \$cc, gc => \$gc, gt => \$ac ); # do some stuff ... ${$seq{$1}}++ while $genome =~ m/(..)/g;
    I don't know how this would compare to the other examples already given, but it's a darn sight better than the original.
      Well, we were not given the loop, but if $i increments by one, that is, it counts overlapping pairs, your solution is not equivalent, as it doesn't count overlapping pairs.

      Abigail

Re: how can I speed up this perl??
by Anonymous Monk on Nov 25, 2003 at 00:31 UTC
    is there something in bioperl ( http://bio.perl.org/ ) that you can use?
      Hello All

      Some genome sequence share a counter. I mapped each genome sequence to the reference of the update counter, so there is some dereferencing.

      Chris

      #!/usr/bin/perl use strict; use warnings; my @genome = ('a', 'a'); my $i = 0; my @seq = qw/ aa ag ac at ta tg tc tt ca cg cc ct ga gg gc gt /; my ($tt, $ag, $ac, $at, $ta, $tg, $ga, $cg, $cc, $gc) = (0) x 10; my %counter; @counter{@seq} = \($tt, $ag, $ac, $at, $ta, $tg, $ga, $tt, $tg, $cg, $cc, $ag, $ga, $cc, $gc, $ac); + if (exists $counter{ $genome[$i] . $genome[$i+1] }) { + ++${ $counter{ $genome[$i] . $genome[$i+1] }} }
Re: how can I speed up this perl??
by Stevie-O (Friar) on Nov 25, 2003 at 06:33 UTC
    I have a couple of solutions noticeably faster than any of the ones put forth by people here today. In fact, I find it somewhat surprising that nobody came up with this before.

    I find the same problem cropping up with XML -- XML is hierarchical (looks organized) and is very easily parsed by reasonably advanced pattern recognition systems, such as regexes or more often the human brain. This is why people like putting everything into XML -- they can easily make sense of it. Computers, however, are very lousy (read: slow) at dealing with strings, just like they are with XML.

    When you put your data into a format that computers are good at -- e.g. numbers -- the result is code that executes even faster than the hash case. I timed it with Time::HiRes; the loop AND the initial transliteration together take less time than using a hash.

    # a computer is processing this data. # put the data into a form the computer handles better. $genome =~ y/atcg/0123/; # voila for ($i=0;$i<length($genome)-1;$i+=2) { $sums[substr($genome,$i,2)]++; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://309487]
Approved by rob_au
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2014-11-26 06:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (163 votes), past polls