An excellent idea. I would add to it by suggesting that you implement a parser for the resulting BLAST report. The BLAST algorithm extends a maximal scoring segment and will pick up substrings that do not perhaps meet minimum length requirements etc. This is where the parser comes in. Set up bioperl (www.bioperl.org) on your system and use Bio::SearchIO for parsing your report for relevant hits. | [reply] |
I definitely agree, Bioperl is the best choice for writing parsers for most sequence or alignment formats, such as FASTA, BLAST, ClustalX, etc.
-Tats
| [reply] |
Sadly I can't use BLAST for this case because BLAST does not account for "fuzzy matching". I'll explain that in a second, but in case anyone isn't familiar with it, let me describe BLAST a bit.
BLAST = Basic Local Alignment Search Tool. It is a way of solving the general problem of aligning two sequences. There is an optimal solution to the alignment of two sequences (a variation of dynamic programming, termed Smith-Waterman alignment). Unfortunately that solution is slow (O(n2), I believe). BLAST is much faster at aligning a single sequence against a database of sequences. What it does is search each member of the databases for short, exact matches from the query sequence. If it does not find these matches, it does not attempt full dynamic programming on that sequence. In other words, BLAST makes the assumption that two sequences with high similarity will have short subsequences of extremely high similarity. One of the big advantages of BLAST is that the statistics of BLAST searching can be estimated (it approximates an extreme-value distribution for the general case). This allows the assignation of a probability value to a similarity, and biologists like p-values.
The scoring system of BLAST is based on two things: a penalty for introducing a gap, and a scoring matrix that gives the +ve score for any given match and the -ve score for any given mismatch. This system is not flexible enough to do the kind of searching I am looking at for three reasons:
- Most BLAST software cannot handle matching of N's (an N is an ambiguous base -- it could be any of the other four
- No BLAST software can handle matching of all other ambiguity codes (e.g. Y for purines, R for pyrimidines, etc.)
- BLAST can't handle "variable" gaps -- for instance one couldn't duplicate the search /GC.{10,13}AT/ in a BLAST
So, unfortunately this is a problem out of the scope of sequence alignments. Also, sequence alignment tools generally work well with large sequences, not with 10-20 letter ones. Some types of BLAST will actually be unable to perform alignments with sequences below 7 or 8 letters.
Sorry for the long explanation, but it's a really good idea and I wanted to explain why it wouldn't work for my situation
-Tats
| [reply] |