onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:
Greetings Perl Monks!
I seek your wisdom for a problem of mine: I need to align 2 short DNA sequences. The starting material is 2 sequences, each 10 nucleotides long (the sequences should typically be As, Ts, Gs, or Cs. When sequence is ambiguous, Ns can also be expected).
There a few different possibilities in how the sequences may align to each other and I list them below:
a. 9 out of 10 in both align to each other perfectly
b. 10 out of 10 in both align to each other perfectly
c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion
d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons
e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons
In all of the cases above, none of the sequences can contain anything but A/T/G/C. If there are other letters such as Ns etc., those cases will need to be discarded without even performing the match test
As 1st pass attempt, I have cobbled up some script, but I know it does not test for all cases above. Would you please tell me if I should use a different approach to test all cases listed above, or can I adapt what I have already?
Thank you exalted ones!
if((defined $upstream_putative_TSD)&&(defined $downstream_p
+utative_TSD))
{
# Check if the putative TSDs differ by just 1 mismatch or
+are perfect matches
my $max_SNP = 1;
my $diffCount = () = ( $upstream_putative_TSD ^ $downs
+tream_putative_TSD ) =~ /[^\x00]/g;
# print $upstream_putative_TSD, "\t", $
+downstream_putative_TSD, "\t", $diffCount, "\n"; # OK thus far
# syntax idea from https://www.biostars.org/p/83978/
if ($diffCount <= $max_SNP)
{
my $upstream_putative_TSD_non_canonical_letter
+_count = $upstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//;
# check to see whether upstream putative TSD c
+ontains anything but A/T/G/C, if yes, how many
my $downstream_putative_TSD_non_canonical_lett
+er_count = $downstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//;
# check to see whether downstream putative TSD
+ contains anything but A/T/G/C, if yes, how many
if(($upstream_putative_TSD_non_canonical_l
+etter_count==0)&&($downstream_putative_TSD_non_canonical_letter_count
+==0))
{
print $_, "\n";
push @output, $_, "\n";
}
}
}
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by graff (Chancellor) on Mar 13, 2015 at 03:00 UTC
|
I don't understand your list of conditions, and it's hard for me to tell how your code relates to them. Could you add some examples that meet and fail the various conditions?
I'm not up on terms in biology. If your "starting material is 2 sequences, each 10 nucleotides long" does that mean that both inputs should be 10 characters, and must contain A,C,G,T? If so, then I'd start out like this:
use strict;
use Carp;
sub compare_TSD {
my %tsd;
my ( $tsd{up}, $tsd{dn} ) = @_;
croak "compare_TSD called without two defined values"
unless ( defined $tsd{up} and defined $tsd{dn} );
my $return_status = '';
for my $arg ( qw/up dn/ ) {
my $str = $arg;
$str .= " has unusable characters," if ( $tsd{$arg} =~ /[^ACGT
+]/ );
$str .= " has wrong character count," if ( length( $tsd{$arg}
+!= 10 );
$return_status .= " $str" if ( $str ne $arg );
}
return $return_status if ( $return_status ne '' );
return "perfect match" if ( $tsd{up} eq $tsd{dn} );
# not sure what else needs to be checked...
}
By putting the logic into a subroutine, you can modularize it, work out the specs and unit tests for its intended behaviors, and make it reusable (not bound up inside any single larger script). You can work out what sorts of values it needs to return to any given caller, in order to make it easier for the caller to do its job with the result. | [reply] [d/l] |
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by roboticus (Chancellor) on Mar 13, 2015 at 01:37 UTC
|
| [reply] |
|
Hi roboticus: Thanks for your reply.
That posting mentions that it might not work for # mismatches > 1
If that statement is indeed true, then it wont help with case c. listed in my OP. Do you care to comment please? Does BrowserUK's incomplete solution, based on your OP there, seem like a viable option for further modification?
| [reply] |
|
onlyIDleft:
I don't see why it won't work with case c. Perhaps I don't understand exactly what you mean. Can you provide a bit more detail or some examples so I can see what you're meaning? If so, I may be able to tweak things up a bit. (It would be best to have a few examples of each of the cases, so we can cover all the corners.)
...roboticus
When your only tool is a hammer, all problems look like your thumb.
| [reply] |
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by wollmers (Scribe) on Mar 13, 2015 at 12:18 UTC
|
1234567890
ACCCCCCCCC_
_CCCCACCCCT
Your method with the exclusive or works only for two strings of the same length, and one or more substitutions, no inserts or deletes.
Helmut Wollmersdorfer | [reply] [d/l] |
|
| [reply] |
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by Laurent_R (Canon) on Mar 13, 2015 at 07:19 UTC
|
Hmm, the last time I studied biology and genetics was 40 years ago, not only did I forget most of it, but the knowledge has changed tremendously in the interval. In short, I really don't know much on the subject.
However, I know that the human genome sequencing project has used very widely Perl, thanks in part to the very good tools that have been developed for that, including a number of good reusable modules. I would think that you should take a look at BioPerl and at related Perl modules, it is quite likely that the tools to do what you want already exist and are available on the CPAN or eslewhere.
| [reply] |
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by Not_a_Number (Prior) on Mar 13, 2015 at 21:50 UTC
|
Hi, onlyIDleft.
Your problem seems interesting.
However, you seem to have missed (I find it hard to believe that you have deliberately ignored...) any requests from fellow monks for concrete examples of string pairs (DNA sequences) that either meet or don't meet your requirements.
To help us (and I agree with anonymonk that "This looks like a really fun problem to work on"), please reply to the following:
a. 9 out of 10 in both align to each other perfectly
No prob:
ACGTACGTAC
GCGTACGTAC
That's okay, right?
b. 10 out of 10 in both align to each other perfectly
Even easier to understand:
ACGTACGTAC
ACGTACGTAC
Perfect match, right?
c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion
I don't understand. Please supply a couple of examples of pairs that meet/don't meet your requirements (with comments, if necessary)
d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons
Ditto
e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons
I think I understand this one, but could you again supply a couple of examples of pairs that meet/don't meet your requirements?
That said, I think that the CPAN module Text::Levenshtein might be what you are looking for. But that could depend on your answers to the above questions...
Update: Corrected copy/paste error(s) | [reply] [d/l] [select] |
|
| [reply] |
|
The "Text-Levenshtein" solution seems like it might work
Whilst Levenshtein will work for your application, it is an exhaustive, and thus very slow O(n*m) algorithm.
Even the XS version is many times slower than the xor method you use in the OP. As such, it is best avoided unless no other short-circuiting method can be found to solve your problem.
The good news is that alternatives are nearly always possible. The only thing lacking here is a clear description of your data.
If you would step back from your jargon and conceptual visualisation of the problem; and answer the multiple, impassioned pleas asking "what does your actual data look like?"; then I'm pretty sure you would have multiple, efficient, working solutions by now.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] |
Re: Filtering matches of near-perfect-matched DNA sequence pairs
by Anonymous Monk on Mar 13, 2015 at 19:56 UTC
|
This looks like a really fun problem to work on,
but I'm unclear about your listed possibilities.
Could you please (please please please) provide
several examples of each of the five listed cases
and a test set (maybe 20 or 30 items) with expected results.
| [reply] |
|
Sorry, sorry, sorry Monks! Here are some verbose explanations and examples, hope these help
Before that, just a lil bit of sequence alignment lingo from biologists dictionary:
substitution: when a letter in one sequence is replaced by a different letter. in another sequence. eg.GGTA is substituted in 2 places to give CCTA , TACGACT substituted in 1 place to give AACGACT etc.
indel: when one letter in one sequence is replaced by nothing in the second sequence. 1st sequence is said to have an insertion (IN) and 2nd sequence has a deletion (DEL). Hence the term INDEL eg. TAGAGGATC and TAGAGATC differ by 1 indel position, so 2nd sequence when aligned would be TAGAG-ATC or TAGA-GATC
case c: the mismatch is not due to substitution of one letter for another, but a gap (shown as '-' here) due to a missing letter when comparing the 2 sequences
AC-TACGTAC
ACGTACGTAC
or
ACGTACGTAC
ACGTACGT-C
case d: the mismatch is due to substitution of one letter for another, and not an insertion or deletion as show in examples above, for case c.
CTTACGTAC
CGTACGTAC
or
CGTACGTGC
CGTACGTCC
case e: same as case d. above, except the matched lengths are 10 letters long, and not 9 letters as for case d.. Mis-match is not due to insertion or deletion, but a substitution, again as for case d.
ACCTACGTAC
ACGTACGTAC
or
GTACGTACGG
GTACGTTCGG
Some examples of what should pass the filters and what should not are shown below
10nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK
ATGGACGTAC
ATGGACGTAC
9nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK
CGTACAGTA
CGTACAGTA
10nt sequences, 1 indel position, passes filter, OK
AC-TACGTAC
ACGTACGTAC
10nt sequences, 2 indel positions in total,1 indel on one sequence and 2nd indel on 2nd sequence, does not passes filter, not OK
AC-TACGTAC
ACGTACG-AC
10nt sequences, 2 indel positions in total, both indels on same sequence, does not passes filter, not OK
AC-TAC-TAC
ACGTACGTAC
Bottom line is that when sequences of 10 letters are aligned to each other, there should be at the very minimum 9 letters that are aligned with a maximum of 1 indel or substitution. At the very best, all 10 letters are perfectly matched with no indels and no substitutions. And all other intermediate cases, some with examples of alignments above. I hope things are a little better to understand now, especially for non-biologists. Sorry for the cryptic explanation in my OP! | [reply] [d/l] [select] |
|
Thank you. I still don't understand, though. What is the input? Are the dashes already there?
| [reply] |
|
|
|
|
|
|