Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
but I did not know whether there were faster ways to accomplish their tasks.

A slightly modified standard string compare, that counts the number of mismatched characters so far and doesn't short-circuit until that number has been exceeded, is O(N) worst case at any given match site; compared to O(N2) best case at any given match site, for any of the distance algorithms.

For a 50/4 in a million that makes it 50 million byte compares worst case and 4 million byte compares best case; versus 2.5 billion bytes compares for every case with Levenshtein.

The bioinformatics crowd tend to use an indexing method. (Look-up BLAST-N, BLAST-X etc.)

If you're looking for 50/4, then there must be at least one exact match of 9 bytes at any successful match site: 9+1+9+1+9+1+9+1+9.

So, if you index all the 9 byte sequences in the haystack; and all the 9 byte sequences in the needle; then lookup all the latter in the former; you can skip huge chunks of the haystack where a match simply could not exist.

This is especially effective when searching each haystack for hundreds or thousands of needles, as the indexing of the haystack gets amortised.

Further, there are on-line front-ends to mainframes/server farms that have many of the commonly searched sequences (Humans, fruit flys; corn; potatoes etc.), pre-indexed, thus further amortising those indexing costs across the searches of hundreds or thousands of individual researchers queries.

There are downsides to BLAST-x; namely, there is typically a fixed size to the exact-match component, often 7 or more; which means there are limits to the ratio of needle size/permitted mismatches that can be searched for. With a minimum of 7; 4*7+3= 31/3 or 5*7+4= 39/4; 6*7+5 = 49/5 etc. Thus, if the mutation site that is being searched for is either shorter than looked for; or contains 1 or more mismatches than is being looked for; some potential sites will simply never be inspected. What you might call: a lossy, fuzzy match.

It's taken me 7 or 8 years to clarify some of those details; and I might still have some of them wrong; but the basic premise is that, given the volumes of data they have to process, very fast, with the possibility of the occasional miss; is far, far preferable to absolute accuracy, if it incurs a substantial slowdown.

Distance algorithms are very, very costly.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

In reply to Re^4: Random shuffling by BrowserUk
in thread Random shuffling by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-19 22:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found