Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Filtering matches of near-perfect-matched DNA sequence pairs

by onlyIDleft (Scribe)
on Mar 13, 2015 at 00:36 UTC ( [id://1119894]=perlquestion: print w/replies, xml ) Need Help??

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Perl Monks!

I seek your wisdom for a problem of mine: I need to align 2 short DNA sequences. The starting material is 2 sequences, each 10 nucleotides long (the sequences should typically be As, Ts, Gs, or Cs. When sequence is ambiguous, Ns can also be expected).

There a few different possibilities in how the sequences may align to each other and I list them below:

a. 9 out of 10 in both align to each other perfectly

b. 10 out of 10 in both align to each other perfectly

c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

In all of the cases above, none of the sequences can contain anything but A/T/G/C. If there are other letters such as Ns etc., those cases will need to be discarded without even performing the match test

As 1st pass attempt, I have cobbled up some script, but I know it does not test for all cases above. Would you please tell me if I should use a different approach to test all cases listed above, or can I adapt what I have already?

Thank you exalted ones!

if((defined $upstream_putative_TSD)&&(defined $downstream_p +utative_TSD)) { # Check if the putative TSDs differ by just 1 mismatch or +are perfect matches my $max_SNP = 1; my $diffCount = () = ( $upstream_putative_TSD ^ $downs +tream_putative_TSD ) =~ /[^\x00]/g; # print $upstream_putative_TSD, "\t", $ +downstream_putative_TSD, "\t", $diffCount, "\n"; # OK thus far # syntax idea from https://www.biostars.org/p/83978/ if ($diffCount <= $max_SNP) { my $upstream_putative_TSD_non_canonical_letter +_count = $upstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//; # check to see whether upstream putative TSD c +ontains anything but A/T/G/C, if yes, how many my $downstream_putative_TSD_non_canonical_lett +er_count = $downstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//; # check to see whether downstream putative TSD + contains anything but A/T/G/C, if yes, how many if(($upstream_putative_TSD_non_canonical_l +etter_count==0)&&($downstream_putative_TSD_non_canonical_letter_count +==0)) { print $_, "\n"; push @output, $_, "\n"; } } }

Replies are listed 'Best First'.
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by graff (Chancellor) on Mar 13, 2015 at 03:00 UTC
    I don't understand your list of conditions, and it's hard for me to tell how your code relates to them. Could you add some examples that meet and fail the various conditions?

    I'm not up on terms in biology. If your "starting material is 2 sequences, each 10 nucleotides long" does that mean that both inputs should be 10 characters, and must contain A,C,G,T? If so, then I'd start out like this:

    use strict; use Carp; sub compare_TSD { my %tsd; my ( $tsd{up}, $tsd{dn} ) = @_; croak "compare_TSD called without two defined values" unless ( defined $tsd{up} and defined $tsd{dn} ); my $return_status = ''; for my $arg ( qw/up dn/ ) { my $str = $arg; $str .= " has unusable characters," if ( $tsd{$arg} =~ /[^ACGT +]/ ); $str .= " has wrong character count," if ( length( $tsd{$arg} +!= 10 ); $return_status .= " $str" if ( $str ne $arg ); } return $return_status if ( $return_status ne '' ); return "perfect match" if ( $tsd{up} eq $tsd{dn} ); # not sure what else needs to be checked... }
    By putting the logic into a subroutine, you can modularize it, work out the specs and unit tests for its intended behaviors, and make it reusable (not bound up inside any single larger script). You can work out what sorts of values it needs to return to any given caller, in order to make it easier for the caller to do its job with the result.
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by roboticus (Chancellor) on Mar 13, 2015 at 01:37 UTC

      Hi roboticus: Thanks for your reply.

      That posting mentions that it might not work for # mismatches > 1

      If that statement is indeed true, then it wont help with case c. listed in my OP. Do you care to comment please? Does BrowserUK's incomplete solution, based on your OP there, seem like a viable option for further modification?

        onlyIDleft:

        I don't see why it won't work with case c. Perhaps I don't understand exactly what you mean. Can you provide a bit more detail or some examples so I can see what you're meaning? If so, I may be able to tweak things up a bit. (It would be best to have a few examples of each of the cases, so we can cover all the corners.)

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Filtering matches of near-perfect-matched DNA sequence pairs
by wollmers (Scribe) on Mar 13, 2015 at 12:18 UTC

    Your specification is ambigous. Do you want to catch also this case:

    1234567890 ACCCCCCCCC_ _CCCCACCCCT

    Your method with the exclusive or works only for two strings of the same length, and one or more substitutions, no inserts or deletes.

    Helmut Wollmersdorfer

      YES! You got it!

      I need to filter-in such cases, currently they are being filtered out, right?

Re: Filtering matches of near-perfect-matched DNA sequence pairs
by Laurent_R (Canon) on Mar 13, 2015 at 07:19 UTC
    Hmm, the last time I studied biology and genetics was 40 years ago, not only did I forget most of it, but the knowledge has changed tremendously in the interval. In short, I really don't know much on the subject.

    However, I know that the human genome sequencing project has used very widely Perl, thanks in part to the very good tools that have been developed for that, including a number of good reusable modules. I would think that you should take a look at BioPerl and at related Perl modules, it is quite likely that the tools to do what you want already exist and are available on the CPAN or eslewhere.

    Je suis Charlie.
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by Not_a_Number (Prior) on Mar 13, 2015 at 21:50 UTC

    Hi, onlyIDleft.

    Your problem seems interesting.

    However, you seem to have missed (I find it hard to believe that you have deliberately ignored...) any requests from fellow monks for concrete examples of string pairs (DNA sequences) that either meet or don't meet your requirements.

    To help us (and I agree with anonymonk that "This looks like a really fun problem to work on"), please reply to the following:

    a. 9 out of 10 in both align to each other perfectly

    No prob:

    ACGTACGTAC GCGTACGTAC

    That's okay, right?


    b. 10 out of 10 in both align to each other perfectly

    Even easier to understand:

    ACGTACGTAC ACGTACGTAC

    Perfect match, right?


    c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

    I don't understand. Please supply a couple of examples of pairs that meet/don't meet your requirements (with comments, if necessary)


    d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

    Ditto


    e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

    I think I understand this one, but could you again supply a couple of examples of pairs that meet/don't meet your requirements?


    That said, I think that the CPAN module Text::Levenshtein might be what you are looking for. But that could depend on your answers to the above questions...


    Update: Corrected copy/paste error(s)

      Not_a_number, thank you for your response. Please see updated info in response to Anonymous_Monk's request for case examples. I hope the examples make it clearer. The "Text-Levenshtein" solution seems like it might work

        The "Text-Levenshtein" solution seems like it might work

        Whilst Levenshtein will work for your application, it is an exhaustive, and thus very slow O(n*m) algorithm.

        Even the XS version is many times slower than the xor method you use in the OP. As such, it is best avoided unless no other short-circuiting method can be found to solve your problem.

        The good news is that alternatives are nearly always possible. The only thing lacking here is a clear description of your data.

        If you would step back from your jargon and conceptual visualisation of the problem; and answer the multiple, impassioned pleas asking "what does your actual data look like?"; then I'm pretty sure you would have multiple, efficient, working solutions by now.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by Anonymous Monk on Mar 13, 2015 at 19:56 UTC

    This looks like a really fun problem to work on, but I'm unclear about your listed possibilities. Could you please (please please please) provide several examples of each of the five listed cases and a test set (maybe 20 or 30 items) with expected results.

      Sorry, sorry, sorry Monks! Here are some verbose explanations and examples, hope these help

      Before that, just a lil bit of sequence alignment lingo from biologists dictionary:

      substitution: when a letter in one sequence is replaced by a different letter. in another sequence. eg.GGTA is substituted in 2 places to give CCTA , TACGACT substituted in 1 place to give AACGACT etc.

      indel: when one letter in one sequence is replaced by nothing in the second sequence. 1st sequence is said to have an insertion (IN) and 2nd sequence has a deletion (DEL). Hence the term INDEL eg. TAGAGGATC and TAGAGATC differ by 1 indel position, so 2nd sequence when aligned would be TAGAG-ATC or TAGA-GATC

      case c: the mismatch is not due to substitution of one letter for another, but a gap (shown as '-' here) due to a missing letter when comparing the 2 sequences

      AC-TACGTAC ACGTACGTAC

      or

      ACGTACGTAC ACGTACGT-C

      case d: the mismatch is due to substitution of one letter for another, and not an insertion or deletion as show in examples above, for case c.

      CTTACGTAC CGTACGTAC

      or

      CGTACGTGC CGTACGTCC

      case e: same as case d. above, except the matched lengths are 10 letters long, and not 9 letters as for case d.. Mis-match is not due to insertion or deletion, but a substitution, again as for case d.

      ACCTACGTAC ACGTACGTAC

      or

      GTACGTACGG GTACGTTCGG

      Some examples of what should pass the filters and what should not are shown below

      10nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

      ATGGACGTAC ATGGACGTAC

      9nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

      CGTACAGTA CGTACAGTA

      10nt sequences, 1 indel position, passes filter, OK

      AC-TACGTAC ACGTACGTAC

      10nt sequences, 2 indel positions in total,1 indel on one sequence and 2nd indel on 2nd sequence, does not passes filter, not OK

      AC-TACGTAC ACGTACG-AC

      10nt sequences, 2 indel positions in total, both indels on same sequence, does not passes filter, not OK

      AC-TAC-TAC ACGTACGTAC

      Bottom line is that when sequences of 10 letters are aligned to each other, there should be at the very minimum 9 letters that are aligned with a maximum of 1 indel or substitution. At the very best, all 10 letters are perfectly matched with no indels and no substitutions. And all other intermediate cases, some with examples of alignments above. I hope things are a little better to understand now, especially for non-biologists. Sorry for the cryptic explanation in my OP!

        Thank you. I still don't understand, though. What is the input? Are the dashes already there?
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1119894]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-24 19:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found