in reply to Re^3: Random shuffling
in thread Random shuffling
but I did not know whether there were faster ways to accomplish their tasks.
A slightly modified standard string compare, that counts the number of mismatched characters so far and doesn't short-circuit until that number has been exceeded, is O(N) worst case at any given match site; compared to O(N2) best case at any given match site, for any of the distance algorithms.
For a 50/4 in a million that makes it 50 million byte compares worst case and 4 million byte compares best case; versus 2.5 billion bytes compares for every case with Levenshtein.
The bioinformatics crowd tend to use an indexing method. (Look-up BLAST-N, BLAST-X etc.)
If you're looking for 50/4, then there must be at least one exact match of 9 bytes at any successful match site: 9+1+9+1+9+1+9+1+9.
So, if you index all the 9 byte sequences in the haystack; and all the 9 byte sequences in the needle; then lookup all the latter in the former; you can skip huge chunks of the haystack where a match simply could not exist.
This is especially effective when searching each haystack for hundreds or thousands of needles, as the indexing of the haystack gets amortised.
Further, there are on-line front-ends to mainframes/server farms that have many of the commonly searched sequences (Humans, fruit flys; corn; potatoes etc.), pre-indexed, thus further amortising those indexing costs across the searches of hundreds or thousands of individual researchers queries.
There are downsides to BLAST-x; namely, there is typically a fixed size to the exact-match component, often 7 or more; which means there are limits to the ratio of needle size/permitted mismatches that can be searched for. With a minimum of 7; 4*7+3= 31/3 or 5*7+4= 39/4; 6*7+5 = 49/5 etc. Thus, if the mutation site that is being searched for is either shorter than looked for; or contains 1 or more mismatches than is being looked for; some potential sites will simply never be inspected. What you might call: a lossy, fuzzy match.
It's taken me 7 or 8 years to clarify some of those details; and I might still have some of them wrong; but the basic premise is that, given the volumes of data they have to process, very fast, with the possibility of the occasional miss; is far, far preferable to absolute accuracy, if it incurs a substantial slowdown.
Distance algorithms are very, very costly.