Re: Filtering matches of near-perfect-matched DNA sequence pairs

Your problem seems interesting.

However, you seem to have missed (I find it hard to believe that you have deliberately ignored...) any requests from fellow monks for concrete examples of string pairs (DNA sequences) that either meet or don't meet your requirements.

To help us (and I agree with anonymonk that "This looks like a really fun problem to work on"), please reply to the following:

a. 9 out of 10 in both align to each other perfectly

No prob:

ACGTACGTAC
GCGTACGTAC
[download]

That's okay, right?

b. 10 out of 10 in both align to each other perfectly

Even easier to understand:

ACGTACGTAC
ACGTACGTAC
[download]

Perfect match, right?

c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

I don't understand. Please supply a couple of examples of pairs that meet/don't meet your requirements (with comments, if necessary)

d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

Ditto

e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

I think I understand this one, but could you again supply a couple of examples of pairs that meet/don't meet your requirements?

That said, I think that the CPAN module Text::Levenshtein might be what you are looking for. But that could depend on your answers to the above questions...

Update: Corrected copy/paste error(s)

Comment on Re: Filtering matches of near-perfect-matched DNA sequence pairs Select or Download Code

Replies are listed 'Best First'.
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft (Scribe) on Mar 15, 2015 at 02:05 UTC
Not_a_number, thank you for your response. Please see updated info in response to Anonymous_Monk's request for case examples. I hope the examples make it clearer. The "Text-Levenshtein" solution seems like it might work	[reply]
Re^3: Filtering matches of near-perfect-matched DNA sequence pairs by BrowserUk (Patriarch) on Mar 15, 2015 at 07:04 UTC
The "Text-Levenshtein" solution seems like it might work Whilst Levenshtein will work for your application, it is an exhaustive, and thus very slow O(n*m) algorithm. Even the XS version is many times slower than the xor method you use in the OP. As such, it is best avoided unless no other short-circuiting method can be found to solve your problem. The good news is that alternatives are nearly always possible. The only thing lacking here is a clear description of your data. If you would step back from your jargon and conceptual visualisation of the problem; and answer the multiple, impassioned pleas asking "what does your actual data look like?"; then I'm pretty sure you would have multiple, efficient, working solutions by now. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]


Don't ask to ask, just ask
	PerlMonks