If I understand correctly your updates, you are looking for approximate matching (aka fuzzy matching). This is much more complicated than just using regular expressions. I'll just throw here a few pointers. Fuzzy matching often implies computing the Hamming distance or the Levenshtein edit distance (look up these names), which both basically reflect the number of differences (or mismatches) between two strings.

You might also look for the Baeza-Yates and Gonnet algorithm (sometimes also called the Baeza-Yates k-mismatches algorithm. Another known algorithm of possible interest is the Wu-Mamber k-differences. You might also take a look at the longest common subsequence problem.

The String::Approx CPAN module might be of some interest.

Finally I would suggest you take a deep look at the http://www.bioperl.org/wiki/Main_Page/. I am not using it personally (as I said, I am not a biologist) and can't say much about it, but it is quite big and is likely to have implemented some of the things you are trying to do. Also the functionalities in there are likely to be optimized for alphabets of 4 letters such as DNA sequences.

Replies are listed 'Best First'.
Re^2: Random shuffling
by BrowserUk (Pope) on Jun 21, 2015 at 12:10 UTC

Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists.

They don't care about the "score"; only, is it there or not. And doing all the O(N2) or worse, work that those modules do to calculate their scores in order to derive a simple boolean answer is just way too costly.

When they usually are doing 10s of thousands of (say) 50/4 matches against thousands of sequences, each containing millions of codons; using those types of fuzzy-match algorithms means runtimes measured in months or years.

There are much faster ways of doing fuzzy matching when all you need is a yes/no answer.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists. (...)

There are much faster ways of doing fuzzy matching when all you need is a yes/no answer.

Thanks for the information. I frankly have a very very limited knowledge of what geneticists are doing with their DNA and other molecular sequences. I suspected that these algorithms might be slow for what geneticists are doing (which is why I recommended to look at BioPerl) but I did not know whether there were faster ways to accomplish their tasks.
but I did not know whether there were faster ways to accomplish their tasks.

A slightly modified standard string compare, that counts the number of mismatched characters so far and doesn't short-circuit until that number has been exceeded, is O(N) worst case at any given match site; compared to O(N2) best case at any given match site, for any of the distance algorithms.

For a 50/4 in a million that makes it 50 million byte compares worst case and 4 million byte compares best case; versus 2.5 billion bytes compares for every case with Levenshtein.

The bioinformatics crowd tend to use an indexing method. (Look-up BLAST-N, BLAST-X etc.)

If you're looking for 50/4, then there must be at least one exact match of 9 bytes at any successful match site: 9+1+9+1+9+1+9+1+9.

So, if you index all the 9 byte sequences in the haystack; and all the 9 byte sequences in the needle; then lookup all the latter in the former; you can skip huge chunks of the haystack where a match simply could not exist.

This is especially effective when searching each haystack for hundreds or thousands of needles, as the indexing of the haystack gets amortised.

Further, there are on-line front-ends to mainframes/server farms that have many of the commonly searched sequences (Humans, fruit flys; corn; potatoes etc.), pre-indexed, thus further amortising those indexing costs across the searches of hundreds or thousands of individual researchers queries.

There are downsides to BLAST-x; namely, there is typically a fixed size to the exact-match component, often 7 or more; which means there are limits to the ratio of needle size/permitted mismatches that can be searched for. With a minimum of 7; 4*7+3= 31/3 or 5*7+4= 39/4; 6*7+5 = 49/5 etc. Thus, if the mutation site that is being searched for is either shorter than looked for; or contains 1 or more mismatches than is being looked for; some potential sites will simply never be inspected. What you might call: a lossy, fuzzy match.

It's taken me 7 or 8 years to clarify some of those details; and I might still have some of them wrong; but the basic premise is that, given the volumes of data they have to process, very fast, with the possibility of the occasional miss; is far, far preferable to absolute accuracy, if it incurs a substantial slowdown.

Distance algorithms are very, very costly.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Hi Laurent and BrowserUK - thank you for your replies. I am not trying to match the sequences to the query myself, the software I am using does it and returns the score. And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter. Quite the contrary.... prime case in point - the BLAST tool. OK, I am going on a tangent, so back to the question at hand:

BrowserUK, I am sure understand your point about randomness being randomness, so my bone of contention or argument is NOT about how many numbers of shuffles, because it is now clear to me that doing it multiple times is not better, but a waste of time, no argument there

I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences. When the sequence is shuffled, there is more noise that is being generated due to the shuffle per se, and since the software does not do a good job of discrimination, it reports a higher number of matches! This is the only conclusion I can arrive at based on my discussion with you folks. DNA random shuffling that I am doing is not a problem, the approach the software used in separating signal from noise is not efficient enough, is what I am thinking....

Unless any of you folks have a different opinion about my conclusion regarding the software I am using, and why it is reported more number of matches despite DNA sequence random shuffling, I consider this matter closed. Thanks to all who participated in this discussion, I am grateful for your inputs, suggestions, thoughts and code. Cheers!