in reply to Re^4: Random shuffling
in thread Random shuffling
And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter.
Whilst I may have overstated the uselessness of distance scoring; they are at best a secondary selection mechanism.
Only once BLAST has discarded 99% of the possible comparison sites -- ie. all those where the score would have been less than 0.99 -- do you then perhaps use the scores of the remaining < 1% to guide you in what order to process them, for greatest likelyhood of finding what you are looking for earlier rather than later.
But the point holds; that scoring every site using a O(N2) algorithm, only to discard 99% of them; would be a huge cost, when there are much faster ways(*) of locating the 1% that are worth considering.
I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences.
Unless any of you folks have a different opinion about my conclusion
What you are seeing -- more matches in shuffled DNA rather than less -- is exactly what I would expect. Here's why.
Despite the random appearance of real DNA to the human eye; the ordering of DNA is very far from random. Indeed, it has evolved to be very, very specific. Random mutations that either do nothing beneficial, or worse, do something detrimental to the species, get lost through natural selection. On the other hand, random mutations that do something useful get replicated and perpetuate.
Thus, when you take specific, targeted subsequences of DNA from one species (or individual) and look for them in the DNA of another species (or individual) they are likely only to be found in that other DNA at one (or a few) places.
If, to allow for some minor differences between species (or individuals) you allow for a small number of mismatches, the number of sites at which you should expect to find a match doesn't change, in real DNA, because the sequence responsible for any particular function (protein expressions?) is still only likely to appear in one (or a few) places; because most of the other DNA has a different purpose and so has evolved to its own, particular, unique pattern.
When you shuffle DNA; you effectively throw away all the millions of years evolution and natural selection.
Not only do you break up the carefully selected for patterns; you also construct patterns that have been eliminated through those same natural selection processes over those millions of years. And thise are the matches (false hits) that you are finding.
To illustrate: let's work with any given 25-codon subsequences in a billion codon sequence. There are 1,125,899,906,842,624 subsequence possibilities, but at most, only 0.0000888% of those appear in the sequence. And each of those actually present will mostly be very different from each of the others; because each has a very specific role to play in the species makeup. Ie. Natural selection will ensure that only important sequences only appear; and the only where they are needed.
But when you shuffle those billion codons; there is no selection process other than chance, meaning every possibility has an equal probability of being generated. So the eventuality that several or many very similar subsequences will be generated is neither unlikely, nor absurd. It is almost guaranteed; because there is no selection process of any kind at play.
But I am quite well versed in the theory and practice of randomised simulations; and the above is very consistent with my expectations in the light of that knowledge.
The nearest, non-genomic analogy I can draw -- to try and convince you that my programmers' instincts and experience have some bearing on the matter -- is that of natural language text.
If you take a chunk of (say) English text and shuffle it; you are unlikely to find many complete, distinct (space delimited) English words in the output.
But, look for sequences of letters that spell words despite that they are encased in other characters that are meaningless and you will find some. Maybe quite a few.
Now loosen your criteria further and allow for one mismatch and suddenly in this: qwertyuiopasdfghjklzxcvbnm, you can find wert/were|went and rty/ray and iop/bop|cop|fop|hop|mop|pop|sop|top and opas/opal and pas/pad|pal|pam|pan|pap|par|pat|paw|pay and pasd/pass|past|paid and asd/add|aid|and|ass etc. Allow 2 mismatches and you'll find many, many more.
The analogy is far from exact; but maybe it will convince you that once you throw away all the rules (natural selection) by randomly shuffling the data; more rather than less fuzzy hits is exactly what you should expect.
(*FTR: I have an algorithm that will process an entire sequence in O(N) time and find every possible match site, regardless of the number of mismatches; for subsequences of up to 64 codons. This, from my limited trials of using the NCBI BLAST site is actually faster than BLAST by a significant margin, and guarantees to find all possible match sites for the given level of fuzziness ... but no one seems interested enough in the idea to work with me to test and validated it.)