Re^4: Random shuffling

Hi Laurent and BrowserUK - thank you for your replies. I am not trying to match the sequences to the query myself, the software I am using does it and returns the score. And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter. Quite the contrary.... prime case in point - the BLAST tool. OK, I am going on a tangent, so back to the question at hand:

BrowserUK, I am sure understand your point about randomness being randomness, so my bone of contention or argument is NOT about how many numbers of shuffles, because it is now clear to me that doing it multiple times is not better, but a waste of time, no argument there

I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences. When the sequence is shuffled, there is more noise that is being generated due to the shuffle per se, and since the software does not do a good job of discrimination, it reports a higher number of matches! This is the only conclusion I can arrive at based on my discussion with you folks. DNA random shuffling that I am doing is not a problem, the approach the software used in separating signal from noise is not efficient enough, is what I am thinking....

Unless any of you folks have a different opinion about my conclusion regarding the software I am using, and why it is reported more number of matches despite DNA sequence random shuffling, I consider this matter closed. Thanks to all who participated in this discussion, I am grateful for your inputs, suggestions, thoughts and code. Cheers!

Comment on Re^4: Random shuffling

Replies are listed 'Best First'.

Re^5: Random shuffling
by BrowserUk (Patriarch) on Jun 22, 2015 at 02:14 UTC

And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter.

Whilst I may have overstated the uselessness of distance scoring; they are at best a secondary selection mechanism.

Only once BLAST has discarded 99% of the possible comparison sites -- ie. all those where the score would have been less than 0.99 -- do you then perhaps use the scores of the remaining < 1% to guide you in what order to process them, for greatest likelyhood of finding what you are looking for earlier rather than later.

But the point holds; that scoring every site using a O(N²) algorithm, only to discard 99% of them; would be a huge cost, when there are much faster ways(*) of locating the 1% that are worth considering.

I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences.

Unless any of you folks have a different opinion about my conclusion

I do.

What you are seeing -- more matches in shuffled DNA rather than less -- is exactly what I would expect. Here's why.

Despite the random appearance of real DNA to the human eye; the ordering of DNA is very far from random. Indeed, it has evolved to be very, very specific. Random mutations that either do nothing beneficial, or worse, do something detrimental to the species, get lost through natural selection. On the other hand, random mutations that do something useful get replicated and perpetuate.

Thus, when you take specific, targeted subsequences of DNA from one species (or individual) and look for them in the DNA of another species (or individual) they are likely only to be found in that other DNA at one (or a few) places.

If, to allow for some minor differences between species (or individuals) you allow for a small number of mismatches, the number of sites at which you should expect to find a match doesn't change, in real DNA, because the sequence responsible for any particular function (protein expressions?) is still only likely to appear in one (or a few) places; because most of the other DNA has a different purpose and so has evolved to its own, particular, unique pattern.

When you shuffle DNA; you effectively throw away all the millions of years evolution and natural selection.

Not only do you break up the carefully selected for patterns; you also construct patterns that have been eliminated through those same natural selection processes over those millions of years. And thise are the matches (false hits) that you are finding.

To illustrate: let's work with any given 25-codon subsequences in a billion codon sequence. There are 1,125,899,906,842,624 subsequence possibilities, but at most, only 0.0000888% of those appear in the sequence. And each of those actually present will mostly be very different from each of the others; because each has a very specific role to play in the species makeup. Ie. Natural selection will ensure that only important sequences only appear; and the only where they are needed.

But when you shuffle those billion codons; there is no selection process other than chance, meaning every possibility has an equal probability of being generated. So the eventuality that several or many very similar subsequences will be generated is neither unlikely, nor absurd. It is almost guaranteed; because there is no selection process of any kind at play.

NB: That, despite the occasional use of a few very basic genomic terms, is very much the conclusion of a non-genomist; and you and everyone else, should read it with that in mind.

But I am quite well versed in the theory and practice of randomised simulations; and the above is very consistent with my expectations in the light of that knowledge.

The nearest, non-genomic analogy I can draw -- to try and convince you that my programmers' instincts and experience have some bearing on the matter -- is that of natural language text.

If you take a chunk of (say) English text and shuffle it; you are unlikely to find many complete, distinct (space delimited) English words in the output.

But, look for sequences of letters that spell words despite that they are encased in other characters that are meaningless and you will find some. Maybe quite a few.

Now loosen your criteria further and allow for one mismatch and suddenly in this: qwertyuiopasdfghjklzxcvbnm, you can find wert/were|went and rty/ray and iop/bop|cop|fop|hop|mop|pop|sop|top and opas/opal and pas/pad|pal|pam|pan|pap|par|pat|paw|pay and pasd/pass|past|paid and asd/add|aid|and|ass etc. Allow 2 mismatches and you'll find many, many more.

The analogy is far from exact; but maybe it will convince you that once you throw away all the rules (natural selection) by randomly shuffling the data; more rather than less fuzzy hits is exactly what you should expect.

(*FTR: I have an algorithm that will process an entire sequence in O(N) time and find every possible match site, regardless of the number of mismatches; for subsequences of up to 64 codons. This, from my limited trials of using the NCBI BLAST site is actually faster than BLAST by a significant margin, and guarantees to find all possible match sites for the given level of fuzziness ... but no one seems interested enough in the idea to work with me to test and validated it.)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

[reply]


Welcome to the Monastery
	PerlMonks