Filtering matches of near-perfect-matched DNA sequence pairs

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Perl Monks!

I seek your wisdom for a problem of mine: I need to align 2 short DNA sequences. The starting material is 2 sequences, each 10 nucleotides long (the sequences should typically be As, Ts, Gs, or Cs. When sequence is ambiguous, Ns can also be expected).

There a few different possibilities in how the sequences may align to each other and I list them below:

a. 9 out of 10 in both align to each other perfectly

b. 10 out of 10 in both align to each other perfectly

c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

In all of the cases above, none of the sequences can contain anything but A/T/G/C. If there are other letters such as Ns etc., those cases will need to be discarded without even performing the match test

As 1st pass attempt, I have cobbled up some script, but I know it does not test for all cases above. Would you please tell me if I should use a different approach to test all cases listed above, or can I adapt what I have already?

Thank you exalted ones!

           if((defined $upstream_putative_TSD)&&(defined $downstream_p
+utative_TSD))
            {
            # Check if the putative TSDs differ by just 1 mismatch or 
+are perfect matches
                my $max_SNP = 1;
                my $diffCount = () = ( $upstream_putative_TSD ^ $downs
+tream_putative_TSD ) =~ /[^\x00]/g;
                #                print $upstream_putative_TSD, "\t", $
+downstream_putative_TSD, "\t", $diffCount, "\n"; # OK thus far

                # syntax idea from https://www.biostars.org/p/83978/
                    if ($diffCount <= $max_SNP)
                    {
                        my $upstream_putative_TSD_non_canonical_letter
+_count = $upstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//;
                        # check to see whether upstream putative TSD c
+ontains anything but A/T/G/C, if yes, how many
                        my $downstream_putative_TSD_non_canonical_lett
+er_count = $downstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//;
                        # check to see whether downstream putative TSD
+ contains anything but A/T/G/C, if yes, how many
                            if(($upstream_putative_TSD_non_canonical_l
+etter_count==0)&&($downstream_putative_TSD_non_canonical_letter_count
+==0))
                                {
                                    print $_, "\n";
                                    push @output, $_, "\n";
                                }
                    }
            }
[download]

Comment on Filtering matches of near-perfect-matched DNA sequence pairs Download Code

Replies are listed 'Best First'.
Re: Filtering matches of near-perfect-matched DNA sequence pairs by graff (Chancellor) on Mar 13, 2015 at 03:00 UTC
I don't understand your list of conditions, and it's hard for me to tell how your code relates to them. Could you add some examples that meet and fail the various conditions? I'm not up on terms in biology. If your "starting material is 2 sequences, each 10 nucleotides long" does that mean that both inputs should be 10 characters, and must contain A,C,G,T? If so, then I'd start out like this: use strict; use Carp; sub compare_TSD { my %tsd; my ( $tsd{up}, $tsd{dn} ) = @_; croak "compare_TSD called without two defined values" unless ( defined $tsd{up} and defined $tsd{dn} ); my $return_status = ''; for my $arg ( qw/up dn/ ) { my $str = $arg; $str .= " has unusable characters," if ( $tsd{$arg} =~ /[^ACGT +]/ ); $str .= " has wrong character count," if ( length( $tsd{$arg} +!= 10 ); $return_status .= " $str" if ( $str ne $arg ); } return $return_status if ( $return_status ne '' ); return "perfect match" if ( $tsd{up} eq $tsd{dn} ); # not sure what else needs to be checked... } [download] By putting the logic into a subroutine, you can modularize it, work out the specs and unit tests for its intended behaviors, and make it reusable (not bound up inside any single larger script). You can work out what sorts of values it needs to return to any given caller, in order to make it easier for the caller to do its job with the result.	[reply] [d/l]
Re: Filtering matches of near-perfect-matched DNA sequence pairs by roboticus (Chancellor) on Mar 13, 2015 at 01:37 UTC
onlyIDleft: It sounds like you could adapt one I put together a couple years ago: Re^2: Comparing a large set of DNA sequences. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft (Scribe) on Mar 13, 2015 at 02:37 UTC
Hi roboticus: Thanks for your reply. That posting mentions that it might not work for # mismatches > 1 If that statement is indeed true, then it wont help with case c. listed in my OP. Do you care to comment please? Does BrowserUK's incomplete solution, based on your OP there, seem like a viable option for further modification?	[reply]
Re^3: Filtering matches of near-perfect-matched DNA sequence pairs by roboticus (Chancellor) on Mar 13, 2015 at 10:03 UTC
onlyIDleft: I don't see why it won't work with case c. Perhaps I don't understand exactly what you mean. Can you provide a bit more detail or some examples so I can see what you're meaning? If so, I may be able to tweak things up a bit. (It would be best to have a few examples of each of the cases, so we can cover all the corners.) ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: Filtering matches of near-perfect-matched DNA sequence pairs by wollmers (Scribe) on Mar 13, 2015 at 12:18 UTC
Your specification is ambigous. Do you want to catch also this case: `1234567890 ACCCCCCCCC_ _CCCCACCCCT` [download] Your method with the exclusive or works only for two strings of the same length, and one or more substitutions, no inserts or deletes. Helmut Wollmersdorfer	[reply] [d/l]
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft (Scribe) on Mar 13, 2015 at 17:53 UTC
YES! You got it! I need to filter-in such cases, currently they are being filtered out, right?	[reply]
Re: Filtering matches of near-perfect-matched DNA sequence pairs by Laurent_R (Canon) on Mar 13, 2015 at 07:19 UTC
Hmm, the last time I studied biology and genetics was 40 years ago, not only did I forget most of it, but the knowledge has changed tremendously in the interval. In short, I really don't know much on the subject. However, I know that the human genome sequencing project has used very widely Perl, thanks in part to the very good tools that have been developed for that, including a number of good reusable modules. I would think that you should take a look at BioPerl and at related Perl modules, it is quite likely that the tools to do what you want already exist and are available on the CPAN or eslewhere. Je suis Charlie.	[reply]
Re: Filtering matches of near-perfect-matched DNA sequence pairs by Not_a_Number (Prior) on Mar 13, 2015 at 21:50 UTC
Hi, onlyIDleft. Your problem seems interesting. However, you seem to have missed (I find it hard to believe that you have deliberately ignored...) any requests from fellow monks for concrete examples of string pairs (DNA sequences) that either meet or don't meet your requirements. To help us (and I agree with anonymonk that "This looks like a really fun problem to work on"), please reply to the following: a. 9 out of 10 in both align to each other perfectly No prob: `ACGTACGTAC GCGTACGTAC` [download] That's okay, right? b. 10 out of 10 in both align to each other perfectly Even easier to understand: `ACGTACGTAC ACGTACGTAC` [download] Perfect match, right? c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion I don't understand. Please supply a couple of examples of pairs that meet/don't meet your requirements (with comments, if necessary) d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons Ditto e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons I think I understand this one, but could you again supply a couple of examples of pairs that meet/don't meet your requirements? That said, I think that the CPAN module Text::Levenshtein might be what you are looking for. But that could depend on your answers to the above questions... Update: Corrected copy/paste error(s)	[reply] [d/l] [select]
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft (Scribe) on Mar 15, 2015 at 02:05 UTC
Not_a_number, thank you for your response. Please see updated info in response to Anonymous_Monk's request for case examples. I hope the examples make it clearer. The "Text-Levenshtein" solution seems like it might work	[reply]
Re^3: Filtering matches of near-perfect-matched DNA sequence pairs by BrowserUk (Patriarch) on Mar 15, 2015 at 07:04 UTC
The "Text-Levenshtein" solution seems like it might work Whilst Levenshtein will work for your application, it is an exhaustive, and thus very slow O(n*m) algorithm. Even the XS version is many times slower than the xor method you use in the OP. As such, it is best avoided unless no other short-circuiting method can be found to solve your problem. The good news is that alternatives are nearly always possible. The only thing lacking here is a clear description of your data. If you would step back from your jargon and conceptual visualisation of the problem; and answer the multiple, impassioned pleas asking "what does your actual data look like?"; then I'm pretty sure you would have multiple, efficient, working solutions by now. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re: Filtering matches of near-perfect-matched DNA sequence pairs by Anonymous Monk on Mar 13, 2015 at 19:56 UTC
This looks like a really fun problem to work on, but I'm unclear about your listed possibilities. Could you please (please please please) provide several examples of each of the five listed cases and a test set (maybe 20 or 30 items) with expected results.	[reply]
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft (Scribe) on Mar 13, 2015 at 23:30 UTC
Sorry, sorry, sorry Monks! Here are some verbose explanations and examples, hope these help Before that, just a lil bit of sequence alignment lingo from biologists dictionary: substitution: when a letter in one sequence is replaced by a different letter. in another sequence. eg.GGTA is substituted in 2 places to give CCTA , TACGACT substituted in 1 place to give AACGACT etc. indel: when one letter in one sequence is replaced by nothing in the second sequence. 1st sequence is said to have an insertion (IN) and 2nd sequence has a deletion (DEL). Hence the term INDEL eg. TAGAGGATC and TAGAGATC differ by 1 indel position, so 2nd sequence when aligned would be TAGAG-ATC or TAGA-GATC case c: the mismatch is not due to substitution of one letter for another, but a gap (shown as '-' here) due to a missing letter when comparing the 2 sequences `AC-TACGTAC ACGTACGTAC` [download] or `ACGTACGTAC ACGTACGT-C` [download] case d: the mismatch is due to substitution of one letter for another, and not an insertion or deletion as show in examples above, for case c. `CTTACGTAC CGTACGTAC` [download] or `CGTACGTGC CGTACGTCC` [download] case e: same as case d. above, except the matched lengths are 10 letters long, and not 9 letters as for case d.. Mis-match is not due to insertion or deletion, but a substitution, again as for case d. `ACCTACGTAC ACGTACGTAC` [download] or `GTACGTACGG GTACGTTCGG` [download] Some examples of what should pass the filters and what should not are shown below 10nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK `ATGGACGTAC ATGGACGTAC` [download] 9nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK `CGTACAGTA CGTACAGTA` [download] 10nt sequences, 1 indel position, passes filter, OK `AC-TACGTAC ACGTACGTAC` [download] 10nt sequences, 2 indel positions in total,1 indel on one sequence and 2nd indel on 2nd sequence, does not passes filter, not OK `AC-TACGTAC ACGTACG-AC` [download] 10nt sequences, 2 indel positions in total, both indels on same sequence, does not passes filter, not OK `AC-TAC-TAC ACGTACGTAC` [download] Bottom line is that when sequences of 10 letters are aligned to each other, there should be at the very minimum 9 letters that are aligned with a maximum of 1 indel or substitution. At the very best, all 10 letters are perfectly matched with no indels and no substitutions. And all other intermediate cases, some with examples of alignments above. I hope things are a little better to understand now, especially for non-biologists. Sorry for the cryptic explanation in my OP!	[reply] [d/l] [select]
Re^3: Filtering matches of near-perfect-matched DNA sequence pairs by choroba (Cardinal) on Mar 13, 2015 at 23:39 UTC
Thank you. I still don't understand, though. What is the input? Are the dashes already there? لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^4: Filtering matches of near-perfect-matched DNA sequence pairs by Anonymous Monk on Mar 14, 2015 at 00:31 UTC
Re^4: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft (Scribe) on Mar 15, 2015 at 02:00 UTC
Re^5: Filtering matches of near-perfect-matched DNA sequence pairs by Anonymous Monk on Mar 15, 2015 at 03:02 UTC
Some notes below your chosen depth have not been shown here


more useful options
	PerlMonks