Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

counting the number of 16384 pattern matches in a large DNA sequence

by anonym (Acolyte)
on Jun 14, 2012 at 15:41 UTC ( #976237=perlquestion: print w/ replies, xml ) Need Help??
anonym has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have been working on getting the count of 16384 patterns in a large string(sequence) .However the code is very slow.Is there an efficient way to do this. I have researched on this a bit, but could not find any way.Below is my code :

@string = <IN1>; chomp @string; foreach my $string (@string) { foreach $key (keys %seq) { chomp $string; study $seq{$key}; my @matches = ($seq{$key} =~ /$string/ig); #print "@matches\n"; $result{$string} = scalar(@matches); } foreach my $s (keys %result) { print "$s => $result{$s}\n"; }

Any help would be appreciated.Thanks

Comment on counting the number of 16384 pattern matches in a large DNA sequence
Download Code
Re: counting the number of 16384 pattern matches in a large DNA sequence
by BrowserUk (Pope) on Jun 14, 2012 at 16:13 UTC
    getting the count of 16384 patterns in a large string(sequence)

    You say "16384 patterns", but you do not show us what those patterns look like?

    You say "in a large sequence", but you are processing a hash of sequences. How many sequences are in the hash? How long are they?

    You say "the code is very slow", without saying how slow, or what order of improvement you are seeking.

    You are slowing your code down by chomping inside the inner loop, which will be redundant after the first time.

    You are also studying the sequences -- a slow process, that may or may not be benefitting you -- inside the inner loop; which means that you will study each sequence once for each pattern. There is no benefit to studying a string more than once.

    If you are serious about improving the performance of this code, you will need to supply considerably more information. A few sample patterns and a sample sequence -- are they just ACGT; or do they contains the full range of nucleic acid codes ACGTURYKMSWBDHVNX-; or the full range of amino acid codes: ABCDEFGHIJKLMNOPQRSTUVWYZX*-; ?

    The better teh info you supply, the more likely that you will get an effective solution.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      There is no benefit to studying a string more than once.

      Just as a matter of curiosity, have you ever come across a case in which there was a benefit from studying a string even once?

      Occasionally and informally, I have Benchmark-ed a use of study that I thought might be beneficial per the description in the docs: matching many regexes against a single, unchanging string. I have never seen any benefit. (I must admit I have never tested the specific example given in the perlfunc docs for which a potential "big win" is claimed.) Has regex optimization reached a point at which we can say "... I ain't gonna study [strings] no more"?

        Just as a matter of curiosity, have you ever come across a case in which there was a benefit from studying a string even once?

        Yes. But it was

        • a while ago on much slower hardware than I now run.

          Boyer-Moore worked best in the days before the chip manufacturers had built string search algorithms into microcode; and before the advent of deeply pipelined architectures with branch predictions, when branching had a proportionally significantly greater impact on throughput than with modern hardware.

        • the needle being search for contained a character that was very rare in the haystack being searched.

          This allows the algorithm to trade intelligence for brute force.

          With modern processors that overlap fetches and branching stalls with other parts of the opcodes, the benefits are greatly curtailed.

        I tried to look up a old post of mine that benchmarked study having a beneficial effect, but couldn't find it.

        I also tried to replicate my memory of what I was doing back then, but cannot reproduce it. Whether that's because my memory is bad, or my modern processor negates the benefits I'm not totally sure, but I suspect the latter.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      As of Perl 5.16.0: "study is now a no-op, presumably fixing all outstanding bugs related to study causing regex matches to behave incorrectly!"

      perldelta: Other Notable Fixes

      It's a shame that it's not mentioned in the POD for study directly, but really not much of a shame to see the function fade away. The few times I attempted to use it, benchmarking showed that I still hadn't found a good application for it.

        but really not much of a shame to see the function fade away

        Agreed. Whilst Boyer-Moore (and some of the others: Aho-Corasick; Rabin-Karp; etc.) can be beneficial for specific tasks and algorithms -- spam filters; virus detection -- they rarely live up to their promise for general purpose work. As such, the complications they add to the runtime aren't worth it for the rare occasions when they are beneficial. Especially on modern hardware with big caches and pipelines.

        For genomic purposes (with its often very small alphabet), it is quite trivial to setup a specialist indexing mechanism that shows far greater benefits.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: counting the number of 16384 pattern matches in a large DNA sequence
by kennethk (Monsignor) on Jun 14, 2012 at 16:14 UTC

    First, since you are talking about performance characteristics, appropriate optimization depends on real use cases; this means we'd need not only your code, but some characteristic input. Then, you should use a profiler (like Devel::NYTProf) in order to accurately identify what's taking all your time.

    On casual examination of your code, I note that you call study multiple times (once for every value of @string) on your %seq values. I haven't used study myself, but it seems like moving that outside your loop would buy you some time, e.g.

    study for values %seq; foreach my $string (@string) { ...

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Thanks for the reply. One of the inputs is a file of 16384 strings given in below example format:

      AAAAAAA AAAAAAT AAAAAAG AAAAAAC AAAAATA AAAAATT AAAAATG AAAAATC AAAAAGA AAAAAGT AAAAAGG

      Another input file is a multi fasta file having sequences for each chromosome like this:

      >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

      so i have stored the sequence for each chromosome in a hash with hash key as chromosome number and concatenated sequence as value.The number of lines in the multifasta file is 61913917.Thanks

        Please see How do I post a question effectively? Once you have a discrete test case that you have run through Devel::NYTProf, and as well run test cases with Benchmark for versions both including and excluding study, we can help you understand your results.

        You may also want to check out index, since you the regex engine is more power than you really need.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        AAAAAAA AAAAAAT AAAAAAG AAAAAAC

        Are all your patterns the same length? Are they all upper case? Are you looking for exact (including case) matches only?

        Is it possible to obtain a copy of the patterns file?

        A small extract from a fasta file I have kicking around (HG:chr20):

        ... NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCCAgaggtggaagaggaaggaagcttggaaccctatagagttgctga gtgccaggaccagatcctggccctaaacaggtggtaaggaaggagagagt gaaggaactgccaggtgacacactcccaccatggacctctgggatcctag ctttaagagatcccatcacccacatgaacgtttgaattgacagggggagc ...

        index is usually faster for matching constant strings, but if you you want case independent matches, you would need to uc the sequences before searching (and studying).


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        One of the inputs is a file of 16384 strings given in below example format:

        Given a string of 7 fasta characters, you can form 4^7= 16384 unique strings from T, A, G, C. So, his algorithm is checking every possible 7 char fasta string against the mutifasta file (number of lines in the multifasta file is 61913917).

Re: counting the number of 16384 pattern matches in a large DNA sequence
by CountZero (Bishop) on Jun 14, 2012 at 18:27 UTC
    You could try to match for the 16384 patterns in one go by building one big regex with Regexp::Assemble.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: counting the number of 16384 pattern matches in a large DNA sequence
by jwkrahn (Monsignor) on Jun 14, 2012 at 19:09 UTC
    @string = <IN1>; chomp @string; foreach my $string (@string) { foreach $key (keys %seq) { chomp $string; study $seq{$key}; my @matches = ($seq{$key} =~ /$string/ig); #print "@matches\n"; $result{$string} = scalar(@matches); } foreach my $s (keys %result) { print "$s => $result{$s}\n"; }

    Your current algorithm would be more efficient as:

    @string = <IN1>; chomp @string; while ( my $string = <IN1> ) { chomp $string; my ( $value ) = values %seq; my $result = () = $value =~ /$string/ig; print "$string => $result\n"; }

      Hey, Thanks.I modified the code above as below to iterate over each value in the %seq hash and it works pretty fast :)

      while (my $string = <IN1>) { chomp $string; #my ($value) = values %seq; foreach $value (values %seq) { my $result = () = $value =~ /$string/g; print "$string => $result\n"; } }

        still for 16384 strings it is taking a lot of time to get the count.Hmm

Re: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?)
by BrowserUk (Pope) on Jun 15, 2012 at 00:11 UTC

    Turn the problem on its head and try it this way:

    sub gen; sub gen{ return @_[1..$#_] if $_[0] == 1; map{ my $p=$_; map{ $p . $_ } gen( $_[0]-1, @_[1..$#_] ) } @_[1..$#_] } my %seqs = ...; my @patterns = gen( 7, qw[A C G T] ); my %counts; for my $seq ( values %seqs ) { ++$counts{ substr $seq, $_, 7 } for 0 .. length( $seq )-7; } print "$_ ::= $counts{ $_ }" for @patterns;

    In my experiments on a 49 million base pairs sequence:

    [ 0:15:31.00] C:\test\humanGenome>..\junk999 chr21.fa 16384 patterns. 49092500 base pairs Using custom indexing found 35106546 matches; took 34.536852 seconds Using custom index2 found 35106546 matches; took 31.354438 seconds Simple search found 35106546 matches; took 2970.517883 seconds
    it was close to 100 times faster than your current method. YMMV.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Hey Thanks. Actually when I tried running this, it did not print anything and I did not quite understand what @_1..$#_ does, the subroutine part.Thanks

        Oh sorry. It must be broken.

        Did you replace the ... in

        my %seqs = ...;

        With your code that loads the hash with your sequences?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        Here is a standalone demonstration that the code I posted works just fine:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; sub gen{ return @_[1..$#_] if $_[0]==1; map{ my $p=$_; map{ $p . $_ } gen($_[0]-1, @_[1..$#_] ) } @_[1..$#_] } our $N //= 7; my %seqs = map { if( length() ) { my( $id, @seq ) = split "\n", $_; $id => join '', @seq; } else { () } } split '>', do{ local $/; uc( <DATA> ) }; my $start = time; my %counts; for my $seq ( values %seqs ) { ++$counts{ substr $seq, $_, $N } for 0 .. length( $seq ) -$N; } print "Took ", time - $start; print "$_ ::= ", $counts{ $_ } // 0 for gen( $N, qw[A C G T] ); __DATA__ > DNA1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA > DNA2 AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACA > DNA3 TTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCC GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC CAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCA AAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAA ATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGC AAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAG > DNA4 AGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTC CCCCTCAAACCTAAGAAATATGTCTGATAAAAGAGTTACTTTGATAGAGT AAATAATAGGAGCTTAAACCCCCTTATTTctaggactatgagaatcgaac ccatccctgagaatccaaaattctccgtgccacctatcacaccccatcct aAAGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTG GTTATACCCTTCCCGTACTAATTAATCCCCTGGCCCAACCCGTCATCTAC

        And some output:

        C:\test>976237-2 -N=2 Took 0.000387907028198242 AA ::= 97 AC ::= 86 AG ::= 49 AT ::= 84 CA ::= 99 CC ::= 130 CG ::= 26 CT ::= 71 GA ::= 35 GC ::= 43 GG ::= 27 GT ::= 36 TA ::= 85 TC ::= 68 TG ::= 39 TT ::= 71

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://976237]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2014-08-27 11:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (237 votes), past polls