Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: counting the number of 16384 pattern matches in a large DNA sequence

by kennethk (Abbot)
on Jun 14, 2012 at 16:14 UTC ( [id://976249]=note: print w/replies, xml ) Need Help??


in reply to counting the number of 16384 pattern matches in a large DNA sequence

First, since you are talking about performance characteristics, appropriate optimization depends on real use cases; this means we'd need not only your code, but some characteristic input. Then, you should use a profiler (like Devel::NYTProf) in order to accurately identify what's taking all your time.

On casual examination of your code, I note that you call study multiple times (once for every value of @string) on your %seq values. I haven't used study myself, but it seems like moving that outside your loop would buy you some time, e.g.

study for values %seq; foreach my $string (@string) { ...

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^2: counting the number of 16384 pattern matches in a large DNA sequence
by anonym (Acolyte) on Jun 14, 2012 at 16:30 UTC

    Thanks for the reply. One of the inputs is a file of 16384 strings given in below example format:

    AAAAAAA AAAAAAT AAAAAAG AAAAAAC AAAAATA AAAAATT AAAAATG AAAAATC AAAAAGA AAAAAGT AAAAAGG

    Another input file is a multi fasta file having sequences for each chromosome like this:

    >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

    so i have stored the sequence for each chromosome in a hash with hash key as chromosome number and concatenated sequence as value.The number of lines in the multifasta file is 61913917.Thanks

      AAAAAAA AAAAAAT AAAAAAG AAAAAAC

      Are all your patterns the same length? Are they all upper case? Are you looking for exact (including case) matches only?

      Is it possible to obtain a copy of the patterns file?

      A small extract from a fasta file I have kicking around (HG:chr20):

      ... NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCCAgaggtggaagaggaaggaagcttggaaccctatagagttgctga gtgccaggaccagatcctggccctaaacaggtggtaaggaaggagagagt gaaggaactgccaggtgacacactcccaccatggacctctgggatcctag ctttaagagatcccatcacccacatgaacgtttgaattgacagggggagc ...

      index is usually faster for matching constant strings, but if you you want case independent matches, you would need to uc the sequences before searching (and studying).


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Thanks. Yes, the patterns are of same length and the search is case sensitive.
      Please see How do I post a question effectively? Once you have a discrete test case that you have run through Devel::NYTProf, and as well run test cases with Benchmark for versions both including and excluding study, we can help you understand your results.

      You may also want to check out index, since you the regex engine is more power than you really need.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        Is there any other way instead of storing the matches in an array @matches to make the code faster? Thanks

      One of the inputs is a file of 16384 strings given in below example format:

      Given a string of 7 fasta characters, you can form 4^7= 16384 unique strings from T, A, G, C. So, his algorithm is checking every possible 7 char fasta string against the mutifasta file (number of lines in the multifasta file is 61913917).

        In that case, it seems like a fairly simple regex should do it. I'd think one regex would be faster than using index to check 16384 different substrings, but a benchmark would tell for sure. I'm also not sure why he's reading the entire huge file into a hash; it seems like that could be running him into swap, slowing things down severely. Unless I'm missing something, it seems like this would work (as long as it's not necessary to match overlapping matches):

        #!/usr/bin/env perl use Modern::Perl; my %c; while(<DATA>){ chomp; while(/([ACGT]{7})/g){ $c{$1}++; } } say "$_ : $c{$_}" for sort keys %c; __DATA__ NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN NNNAGTCCCACANNNNTAAAAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNN +N NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN

        Aaron B.
        Available for small or large Perl jobs; see my home node.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://976249]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 01:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found