Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Create the reverse complement DNA sequence without pattern matching and reverse built-in function?

by Anonymous Monk
on Feb 19, 2014 at 14:54 UTC ( [id://1075463]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I am supposed to write a script for creating the reverse complement DNA sequence, and I already know how to do it by using pattern matching and the reverse function.
$seq="ACGGGAGGACGGGAAAATTACTACGGCATTAGC"; print "Initial seq is:\n".$seq."\n"; my $complement=$seq; $complement =~ tr/ACGTacgt/TGCAtgca/; print "Complement is:\n".$complement."\n"; $reverse = reverse($complement); print "Reverse complement is:\n".$reverse."\n";
Is there a way of doing this without regular expressions, using only substr function?
  • Comment on Create the reverse complement DNA sequence without pattern matching and reverse built-in function?
  • Select or Download Code

Replies are listed 'Best First'.
Re: Create the reverse complement DNA sequence without pattern matching and reverse built-in function?
by davido (Cardinal) on Feb 19, 2014 at 15:05 UTC

    Sure there is. You could even reimplement an entire regular expression engine using substr, though I wouldn't recommend it. However, you're not using a regular expression here, you're using transliteration: tr///.

    my $seq="ACGGGAGGACGGGAAAATTACTACGGCATTAGC"; my %xref = ( A => 'T', C => 'G', G => 'C', T => 'A', a => 't', c => 'g', g => 'c', t => 'a', ); for(my $ix = 0; $ix < length $seq; ++$ix ) { substr( $seq, $ix, 1, $xref{ substr( $seq, $ix, 1 ) } ); } my $reverse = reverse($seq); print $reverse, "\n";

    Proof you can write "almost C" in Perl.

    I would expect this to be tremendously slower than using tr/// on large data sets.


    Dave

      Another similar approach, for illustration purposes:

      perl -Mstrict -Mwarnings -le 'my $seq=q{ACGGGAGGACGGGAAAATTACTACGGCATTAGC}; my $complement; my %rc = ( A => q{T}, T => q{A}, C => q{G}, G => q{C}, ); my @letters = split //, $seq; while ( @letters ) { my $l = uc pop @letters; $complement .= $rc{$l}; } print $seq; print $complement;'

      Hope that helps.

Re: Create the reverse complement DNA sequence without pattern matching and reverse built-in function?
by BrowserUk (Patriarch) on Feb 19, 2014 at 16:12 UTC
    Is there a way of doing this ..., using only substr function?

    As has been shown, you can. But why would you?

    tr *ISN'T* a regular expression. It is a compile time constructed table lookup; designed for exactly this purpose; and it will do the job anywhere from 100x to 400 78x & 270 times faster than the other methods offered here:

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $seq = "ACGGGAGGACGGGAAAATTACTACGGCATTAGCacgggaggacgggaaaattactacg +gcattagc"; our %xref = ( A => 'T', C => 'G', G => 'C', T => 'A', a => 't', c => 'g', g => 'c', t => 'a', ); our %rc = ( A => q{T}, T => q{A}, C => q{G}, G => q{C}, ); cmpthese -1, { tr => q[ ( my $revcmp = reverse $seq ) =~tr[ACGTacgt][tgcaTGCA]; ], davido => q[ my $seq = $seq; for(my $ix = 0; $ix < length $seq; ++$ix ) { substr( $seq, $ix, 1, $xref{ substr( $seq, $ix, 1 ) } ); } my $reverse = reverse($seq); ], atcroft => q[ my $complement; my @letters = split //, $seq; while ( @letters ) { my $l = uc pop @letters; $complement .= $rc{$l}; } ], hdb => q[ my $rev=''; my $n = length $seq; while( $n-- ){ $rev .= $_ for map chr( $_ & 2 ? $_^4: $_^21 ), ord substr + $seq, $n, 1 } ], } __END__ C:\test>junk60 Rate hdb atcroft davido tr hdb 9827/s -- -35% -71% -100% atcroft 15077/s 53% -- -55% -99% davido 33822/s 244% 124% -- -99% tr 2679052/s 27163% 17669% 7821% --

    (I know you guys know this; but does the OP??? And if they do, why are they asking for this?)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I tried to alert the OP to this issue in my post:

      I would expect this to be tremendously slower than using tr/// on large data sets.

      And your benchmark confirms it; there's a tremendous difference, in favor of using tr/// to do what it was designed to do. I think your tr/// method missed reversing the string afterwards, but that won't change the fact that the tr/// approach is the way to go.


      Dave

        I tried to alert the OP to this issue in my post:

        Sorry. I missed that.

        your tr/// method missed reversing the string afterwards, but that won't change the fact that the tr/// approach is the way to go.

        Right on both counts. And it make's more of a difference than I was expecting, but still 80x faster than the next best:

        tr => q[ ( my $revcmp = reverse $seq ) =~ tr[ACGTacgt][tgcaTGCA]; ], C:\test>junk60 Rate hdb atcroft davido tr hdb 9827/s -- -36% -71% -100% atcroft 15292/s 56% -- -55% -99% davido 33822/s 244% 121% -- -99% tr 2640400/s 26770% 17167% 7707% --

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Create the reverse complement DNA sequence without pattern matching and reverse built-in function?
by kcott (Archbishop) on Feb 19, 2014 at 16:35 UTC

    TMTOWTDI:

    #!/usr/bin/env perl -l use strict; use warnings; my %comp = qw{A T C G G C T A a t c g g c t a}; my $seq = 'ACGTacgt'; print 'Initial: ', $seq; print 'Complement: ', @comp{split '' => $seq}; print 'Reverse Complement: ', @comp{reverse split '' => $seq};

    Output:

    Initial: ACGTacgt Complement: TGCAtgca Reverse Complement: acgtACGT

    Update:

    For the OP: I hope, from other posts in this thread, that it's clear there's no benefit in not using the appropriate tools provided with the language, i.e. tr and reverse.

    However, if you do have some reason for writing your code this way, I notice that I did, in fact, use reverse. So, in keeping with the hash slice solution I used, you can get the same output by changing the last two lines to:

    my @n = split '' => $seq; print 'Complement: ', @comp{@n}; print 'Reverse Complement: ', @comp{map { $n[$#n - $_] } 0 .. $#n};

    -- Ken

Re: Create the reverse complement DNA sequence without pattern matching and reverse built-in function?
by hdb (Monsignor) on Feb 19, 2014 at 15:51 UTC

    Not recommended for production:

    use strict; use warnings; my $seq='ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; my $rev=''; my $n = length $seq; while( $n-- ){ $rev .= $_ for map chr($_&2?$_^4:$ +_^21), ord substr $seq, $n, 1 } print "$seq\n$rev\n";

    Something reminds me of Black Jack...

    Update: ...or:

    use strict; use warnings; my $seq='ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; my $rev=''; $rev = "$_$rev" for map chr($_&2?$_^4:$_^21), map ord, split//,$seq; print "$seq\n$rev\n";
Re: Create the reverse complement (OT: meaning / use of "reverse complement")
by ww (Archbishop) on Feb 19, 2014 at 16:55 UTC

    Could some kind soul explain how "reverse complement" works? Is is not the same as "reverse? -- ie, is NOT one word or the other redundant?
    Update: Has been explained! See below and the replies from BrowserUK and kcott

    And if that's a bit fuzzy around the edges (yeah, it is, IMO), code to illustrate why I'm puzzled with OP's use of the phrase "reverse complement" and its adoption by others:

    #!/usr/bin/perl use 5.016; use warnings; # 1075463 my $seq="ACGGGAGGACGGGAAAATTACTACGGCATTAGC"; say "Initial \$seq is: $seq"; say "Reverse sequence is: " . reverse($seq); my @seq = split(/\b/, $seq); my $comp; while (@seq) { $comp .= pop(@seq); } print "Complement is: $comp\n"; my $reverse = reverse($comp); print "Reverse complement is:$reverse\n"; =head OUTPUT: Initial $seq is: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Reverse sequence is: CGATTACGGCATCATTAAAAGGGCAGGAGGGCA Complement is: ACGGGAGGACGGGAAAATTACTACGGCATTAGC # per OP's use, + I think Reverse complement is:CGATTACGGCATCATTAAAAGGGCAGGAGGGCA =cut

    On-line dictionaries tend to mention the mathematics use only in geometric terms. Is my understanding that the OP's phrase would be better written and less ambiguous using just one of its two terms?

    Update:

    /me headslaps! Duh! Ignored the "double helix" fundamental! Thanks, obeisances and + +s to both BrowserUK and kcott.

    Come, let us reason together: Spirit of the Monastery

      I'll attempt to describe this without using any biological references (therefore, highly oversimplified).

      DNA is made up of long strands (sequences) of just four building blocks (given the initials A, C, G and T). Here's an example sequence:

      A | C | G | T | G

      The sequence here is ACGTG; the reverse sequence would be that read from the other end, i.e. GTGCA.

      DNA strands form pairs. Each of the building blocks pairs with only one of the other building blocks: A with T and C with G (and vice versa). Knowing the sequence of one strand, the complementary sequence is easily determined:

      A==T | | C==G | | G==C | | T==A | | G==C

      So, the complement of ACGTG is TGCAC; and the reverse complement is just that read from the other end, i.e. CACGT.

      As I said, that's highly oversimplified but hopefully is enough to explain the terms.

      -- Ken

      Google shows that is a well established term in genomics.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1075463]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-18 21:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found