Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Performance Tuning: Searching Long-Sequence Permutations

by thraxil (Prior)
on Jun 26, 2003 at 20:49 UTC ( [id://269411]=note: print w/replies, xml ) Need Help??


in reply to Performance Tuning: Searching Long-Sequence Permutations

is there any way you can first use something like BLAST to cut down the search space? i don't really know enough about your application, but trying to find a way to let BLAST do some of the work would be my first instinct. maybe you can get BLAST to give you a list of the substrings with very rough alignments (which it will do very quickly. 10-50k base-pairs is nothing to BLAST) and then just search through those with your regexps.

anders pearson

  • Comment on Re: Performance Tuning: Searching Long-Sequence Permutations

Replies are listed 'Best First'.
Re: Re: Performance Tuning: Searching Long-Sequence Permutations
by Anonymous Monk on Jun 26, 2003 at 20:54 UTC
    An excellent idea. I would add to it by suggesting that you implement a parser for the resulting BLAST report. The BLAST algorithm extends a maximal scoring segment and will pick up substrings that do not perhaps meet minimum length requirements etc. This is where the parser comes in. Set up bioperl (www.bioperl.org) on your system and use Bio::SearchIO for parsing your report for relevant hits.

      I definitely agree, Bioperl is the best choice for writing parsers for most sequence or alignment formats, such as FASTA, BLAST, ClustalX, etc.

      -Tats
Re: Re: Performance Tuning: Searching Long-Sequence Permutations
by Itatsumaki (Friar) on Jun 27, 2003 at 16:06 UTC

    Sadly I can't use BLAST for this case because BLAST does not account for "fuzzy matching". I'll explain that in a second, but in case anyone isn't familiar with it, let me describe BLAST a bit.

    BLAST = Basic Local Alignment Search Tool. It is a way of solving the general problem of aligning two sequences. There is an optimal solution to the alignment of two sequences (a variation of dynamic programming, termed Smith-Waterman alignment). Unfortunately that solution is slow (O(n2), I believe). BLAST is much faster at aligning a single sequence against a database of sequences. What it does is search each member of the databases for short, exact matches from the query sequence. If it does not find these matches, it does not attempt full dynamic programming on that sequence. In other words, BLAST makes the assumption that two sequences with high similarity will have short subsequences of extremely high similarity. One of the big advantages of BLAST is that the statistics of BLAST searching can be estimated (it approximates an extreme-value distribution for the general case). This allows the assignation of a probability value to a similarity, and biologists like p-values.

    The scoring system of BLAST is based on two things: a penalty for introducing a gap, and a scoring matrix that gives the +ve score for any given match and the -ve score for any given mismatch. This system is not flexible enough to do the kind of searching I am looking at for three reasons:

    1. Most BLAST software cannot handle matching of N's (an N is an ambiguous base -- it could be any of the other four
    2. No BLAST software can handle matching of all other ambiguity codes (e.g. Y for purines, R for pyrimidines, etc.)
    3. BLAST can't handle "variable" gaps -- for instance one couldn't duplicate the search /GC.{10,13}AT/ in a BLAST

    So, unfortunately this is a problem out of the scope of sequence alignments. Also, sequence alignment tools generally work well with large sequences, not with 10-20 letter ones. Some types of BLAST will actually be unable to perform alignments with sequences below 7 or 8 letters.

    Sorry for the long explanation, but it's a really good idea and I wanted to explain why it wouldn't work for my situation

    -Tats

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://269411]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 13:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found