Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Filtering matches of near-perfect-matched DNA sequence pairs

by Not_a_Number (Prior)
on Mar 13, 2015 at 21:50 UTC ( [id://1120000]=note: print w/replies, xml ) Need Help??


in reply to Filtering matches of near-perfect-matched DNA sequence pairs

Hi, onlyIDleft.

Your problem seems interesting.

However, you seem to have missed (I find it hard to believe that you have deliberately ignored...) any requests from fellow monks for concrete examples of string pairs (DNA sequences) that either meet or don't meet your requirements.

To help us (and I agree with anonymonk that "This looks like a really fun problem to work on"), please reply to the following:

a. 9 out of 10 in both align to each other perfectly

No prob:

ACGTACGTAC GCGTACGTAC

That's okay, right?


b. 10 out of 10 in both align to each other perfectly

Even easier to understand:

ACGTACGTAC ACGTACGTAC

Perfect match, right?


c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

I don't understand. Please supply a couple of examples of pairs that meet/don't meet your requirements (with comments, if necessary)


d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

Ditto


e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

I think I understand this one, but could you again supply a couple of examples of pairs that meet/don't meet your requirements?


That said, I think that the CPAN module Text::Levenshtein might be what you are looking for. But that could depend on your answers to the above questions...


Update: Corrected copy/paste error(s)

Replies are listed 'Best First'.
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs
by onlyIDleft (Scribe) on Mar 15, 2015 at 02:05 UTC

    Not_a_number, thank you for your response. Please see updated info in response to Anonymous_Monk's request for case examples. I hope the examples make it clearer. The "Text-Levenshtein" solution seems like it might work

      The "Text-Levenshtein" solution seems like it might work

      Whilst Levenshtein will work for your application, it is an exhaustive, and thus very slow O(n*m) algorithm.

      Even the XS version is many times slower than the xor method you use in the OP. As such, it is best avoided unless no other short-circuiting method can be found to solve your problem.

      The good news is that alternatives are nearly always possible. The only thing lacking here is a clear description of your data.

      If you would step back from your jargon and conceptual visualisation of the problem; and answer the multiple, impassioned pleas asking "what does your actual data look like?"; then I'm pretty sure you would have multiple, efficient, working solutions by now.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1120000]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-19 18:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found