### Re^3: Random shuffling

by Laurent_R (Canon)
 on Jun 21, 2015 at 14:10 UTC ( #1131334=note: print w/replies, xml ) Need Help??

in reply to Re^2: Random shuffling

Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists. (...)

There are much faster ways of doing fuzzy matching when all you need is a yes/no answer.

Thanks for the information. I frankly have a very very limited knowledge of what geneticists are doing with their DNA and other molecular sequences. I suspected that these algorithms might be slow for what geneticists are doing (which is why I recommended to look at BioPerl) but I did not know whether there were faster ways to accomplish their tasks.

Replies are listed 'Best First'.
Re^4: Random shuffling
by BrowserUk (Pope) on Jun 21, 2015 at 15:16 UTC
but I did not know whether there were faster ways to accomplish their tasks.

A slightly modified standard string compare, that counts the number of mismatched characters so far and doesn't short-circuit until that number has been exceeded, is O(N) worst case at any given match site; compared to O(N2) best case at any given match site, for any of the distance algorithms.

For a 50/4 in a million that makes it 50 million byte compares worst case and 4 million byte compares best case; versus 2.5 billion bytes compares for every case with Levenshtein.

The bioinformatics crowd tend to use an indexing method. (Look-up BLAST-N, BLAST-X etc.)

If you're looking for 50/4, then there must be at least one exact match of 9 bytes at any successful match site: 9+1+9+1+9+1+9+1+9.

So, if you index all the 9 byte sequences in the haystack; and all the 9 byte sequences in the needle; then lookup all the latter in the former; you can skip huge chunks of the haystack where a match simply could not exist.

This is especially effective when searching each haystack for hundreds or thousands of needles, as the indexing of the haystack gets amortised.

Further, there are on-line front-ends to mainframes/server farms that have many of the commonly searched sequences (Humans, fruit flys; corn; potatoes etc.), pre-indexed, thus further amortising those indexing costs across the searches of hundreds or thousands of individual researchers queries.

There are downsides to BLAST-x; namely, there is typically a fixed size to the exact-match component, often 7 or more; which means there are limits to the ratio of needle size/permitted mismatches that can be searched for. With a minimum of 7; 4*7+3= 31/3 or 5*7+4= 39/4; 6*7+5 = 49/5 etc. Thus, if the mutation site that is being searched for is either shorter than looked for; or contains 1 or more mismatches than is being looked for; some potential sites will simply never be inspected. What you might call: a lossy, fuzzy match.

It's taken me 7 or 8 years to clarify some of those details; and I might still have some of them wrong; but the basic premise is that, given the volumes of data they have to process, very fast, with the possibility of the occasional miss; is far, far preferable to absolute accuracy, if it incurs a substantial slowdown.

Distance algorithms are very, very costly.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Thank you very much, BrowserUk, for this interesting information.
Re^4: Random shuffling
by onlyIDleft (Scribe) on Jun 22, 2015 at 00:19 UTC

Hi Laurent and BrowserUK - thank you for your replies. I am not trying to match the sequences to the query myself, the software I am using does it and returns the score. And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter. Quite the contrary.... prime case in point - the BLAST tool. OK, I am going on a tangent, so back to the question at hand:

BrowserUK, I am sure understand your point about randomness being randomness, so my bone of contention or argument is NOT about how many numbers of shuffles, because it is now clear to me that doing it multiple times is not better, but a waste of time, no argument there

I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences. When the sequence is shuffled, there is more noise that is being generated due to the shuffle per se, and since the software does not do a good job of discrimination, it reports a higher number of matches! This is the only conclusion I can arrive at based on my discussion with you folks. DNA random shuffling that I am doing is not a problem, the approach the software used in separating signal from noise is not efficient enough, is what I am thinking....

Unless any of you folks have a different opinion about my conclusion regarding the software I am using, and why it is reported more number of matches despite DNA sequence random shuffling, I consider this matter closed. Thanks to all who participated in this discussion, I am grateful for your inputs, suggestions, thoughts and code. Cheers!

And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter.

Whilst I may have overstated the uselessness of distance scoring; they are at best a secondary selection mechanism.

Only once BLAST has discarded 99% of the possible comparison sites -- ie. all those where the score would have been less than 0.99 -- do you then perhaps use the scores of the remaining < 1% to guide you in what order to process them, for greatest likelyhood of finding what you are looking for earlier rather than later.

But the point holds; that scoring every site using a O(N2) algorithm, only to discard 99% of them; would be a huge cost, when there are much faster ways(*) of locating the 1% that are worth considering.

I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences.
Unless any of you folks have a different opinion about my conclusion

I do.

What you are seeing -- more matches in shuffled DNA rather than less -- is exactly what I would expect. Here's why.

Despite the random appearance of real DNA to the human eye; the ordering of DNA is very far from random. Indeed, it has evolved to be very, very specific. Random mutations that either do nothing beneficial, or worse, do something detrimental to the species, get lost through natural selection. On the other hand, random mutations that do something useful get replicated and perpetuate.

Thus, when you take specific, targeted subsequences of DNA from one species (or individual) and look for them in the DNA of another species (or individual) they are likely only to be found in that other DNA at one (or a few) places.

If, to allow for some minor differences between species (or individuals) you allow for a small number of mismatches, the number of sites at which you should expect to find a match doesn't change, in real DNA, because the sequence responsible for any particular function (protein expressions?) is still only likely to appear in one (or a few) places; because most of the other DNA has a different purpose and so has evolved to its own, particular, unique pattern.

When you shuffle DNA; you effectively throw away all the millions of years evolution and natural selection.

Not only do you break up the carefully selected for patterns; you also construct patterns that have been eliminated through those same natural selection processes over those millions of years. And thise are the matches (false hits) that you are finding.

To illustrate: let's work with any given 25-codon subsequences in a billion codon sequence. There are 1,125,899,906,842,624 subsequence possibilities, but at most, only 0.0000888% of those appear in the sequence. And each of those actually present will mostly be very different from each of the others; because each has a very specific role to play in the species makeup. Ie. Natural selection will ensure that only important sequences only appear; and the only where they are needed.

But when you shuffle those billion codons; there is no selection process other than chance, meaning every possibility has an equal probability of being generated. So the eventuality that several or many very similar subsequences will be generated is neither unlikely, nor absurd. It is almost guaranteed; because there is no selection process of any kind at play.

#### NB: That, despite the occasional use of a few very basic genomic terms, is very much the conclusion of a non-genomist; and you and everyone else, should read it with that in mind.

But I am quite well versed in the theory and practice of randomised simulations; and the above is very consistent with my expectations in the light of that knowledge.

The nearest, non-genomic analogy I can draw -- to try and convince you that my programmers' instincts and experience have some bearing on the matter -- is that of natural language text.

If you take a chunk of (say) English text and shuffle it; you are unlikely to find many complete, distinct (space delimited) English words in the output.

But, look for sequences of letters that spell words despite that they are encased in other characters that are meaningless and you will find some. Maybe quite a few.

Now loosen your criteria further and allow for one mismatch and suddenly in this: qwertyuiopasdfghjklzxcvbnm, you can find wert/were|went and rty/ray and iop/bop|cop|fop|hop|mop|pop|sop|top and opas/opal and pas/pad|pal|pam|pan|pap|par|pat|paw|pay and pasd/pass|past|paid and asd/add|aid|and|ass etc. Allow 2 mismatches and you'll find many, many more.

The analogy is far from exact; but maybe it will convince you that once you throw away all the rules (natural selection) by randomly shuffling the data; more rather than less fuzzy hits is exactly what you should expect.

(*FTR: I have an algorithm that will process an entire sequence in O(N) time and find every possible match site, regardless of the number of mismatches; for subsequences of up to 64 codons. This, from my limited trials of using the NCBI BLAST site is actually faster than BLAST by a significant margin, and guarantees to find all possible match sites for the given level of fuzziness ... but no one seems interested enough in the idea to work with me to test and validated it.)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Create A New User
Node Status?
node history
Node Type: note [id://1131334]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2020-10-23 08:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My favourite web site is:

Results (236 votes). Check out past polls.

Notices?