Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Mustering Reads

by neversaint (Deacon)
on Nov 01, 2010 at 10:02 UTC ( #868716=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have the following data and problem:
my $k =25; my %readrepo =( "readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", "readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", "readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", "readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", ); # This array is already ordered # It says the first(_1) 25 bases of readA overlap with second(_2)part +of readB, etc. my @readstoconcate = ( "readA_1", "readB_2", "readC_1", "readD_2");
What I want to do is to assemble the reads in %readrepo based on the order given by @readstoconcatenate
readA GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA readB TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC readC GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG readD TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC
Yielding the final long stretch of sequence
TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G)
How should one go about it?
Update: Corrected the answer bug above. Thanks to BrowserUK suggestion.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Mustering Reads
by Corion (Pope) on Nov 01, 2010 at 10:14 UTC

    Likely, substr is all you need, if you already know what parts overlap and what parts do not overlap. Once you have the non-overlapping parts, you can go through them character by character and if they match, you output that character, and if they don't match, you output the pair. I think there can only be two pairs anyway, so a fancy trick would be to use binary XOR on the "nonmatched" parts:

    use strict; my @nonmatched = qw(TACTCAGGAG TACTCGGGAG); my $difference = $nonmatched[0] ^ $nonmatched[ 1 ]; # Replace all zeroes (= matches) by the original string: $difference =~ s/\0/substr($nonmatched[0],pos($difference),1)/ge; # Replace the "weird" parts with the proper pairs my %pairs = ( 'C' ^ 'T' => '(C,T)', 'A' ^ 'G' => '(A,G)', ); $difference =~ s/(.)/$pairs{$1} || $1/ge; print $difference;
Re: Mustering Reads
by JavaFan (Canon) on Nov 01, 2010 at 12:16 UTC
    The following does seem to do the trick. It should work for any number of "repo"s. It does assume all are of the same length, but it wouldn't be complicated to adjust for varying lengths:
    use 5.010; use strict; use warnings; my $k = 25; my %readrepo =( readA => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", readB => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", readC => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", readD => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", ); my @readstoconcate = ( "readA_1", "readB_2", "readC_1", "readD_2", ); my @prefixes; my @postfixes; my $fixed; foreach my $read (@readstoconcate) { my ($entry, $tag) = split '_', $read; if ($tag == 1) { push @postfixes, substr $readrepo{$entry}, $k; $fixed ||= substr $readrepo{$entry}, 0, $k; } else { my $k1 = length($readrepo{$entry}) - $k; push @prefixes, substr $readrepo{$entry}, 0, $k1; $fixed ||= substr $readrepo{$entry}, $k1; } } # # Assume all entries in %readrepo are the same length. # foreach my $set (\@prefixes, \@postfixes) { next unless @$set; for (my $i = 0; $i < length $$set[0]; $i++) { my $l = "\x0"; my @c = grep {my $r = ($_ ne $l); $l = $_; $r} map {substr $_, $i, 1} @$set; local $" = ","; print @c == 1 ? $c[0] : "(@c)"; } } continue { state $flag = 0; print $fixed unless $flag++; } print "\n"; __END__ TACTC(A,G)GGAGGCTGAGGCAGGAGAATTGCTTGAAC(C,T)T(G,A)GG(A,G)G(G,A)(C,T)(A +,G)
Re: Mustering Reads
by BrowserUk (Pope) on Nov 01, 2010 at 11:12 UTC

    This produces a slightly different output for your sample than you provided--but I think mine is right :)

    (You'll have to explain to me why it isn't, if it isn't.)

    This could probably be made quite a bit more efficient with further thought, but I wanted to check if it is correct first.

    #! perl -slw use strict; my $k = 25; my %repo = ( "readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", "readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", "readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", "readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", ); my @order = ( "readA_1", "readB_2", "readC_1", "readD_2"); my( @heads, @tails, $common ); while( @order ) { my( $s1, $p1, $s2, $p2 ) = map split( '_', shift @order ), 1 .. 2; ( $s1, $s2 ) = ( $s2, $s1 ) if $p1 > $p2; push @heads, substr $repo{ $s2 }, 0, length( $repo{ $s2 } ) - $k; push @tails, substr $repo{ $s1 }, $k; $common = substr $repo{ $s1 }, -$k unless $common; } my $head = ''; for my $p ( 0 .. length( $heads[0] )-1 ) { my %uniq; ++$uniq{ substr $heads[ $_ ], $p, 1 } for 0 .. $#heads; if( keys %uniq > 1 ) { $head .= '(' . join( ',', keys %uniq ) . ')'; } else { $head .= each %uniq; } } my $tail = ''; for my $p ( 0 .. length( $tails[0] )-1 ) { my %uniq; ++$uniq{ substr $tails[ $_ ], $p, 1 } for 0 .. $#tails; if( keys %uniq > 1 ) { $tail .= '(' . join( ',', keys %uniq ) . ')'; } else { $tail .= each %uniq; } } print $head, $common, $tail; __END__ c:\test>868716 TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,
      I was trying the following input to your code above:
      my $k = 15; my %repo = ( "readA" => "AAACATGAAAAGAAATGATGAAACAGA", "readB" => "AAATCATGAAACAGAGCCTCATCTCCC ", ); my @order = ( "readB_1","readA_2");
      Why it gives this output
      AAACATGAAAAGGAGCCTCATCTCCC GCCTCATCTCCC
      Instead of:
      AAACATGAAAAGAAAT(G,C)ATGAAACAGAGCCTCATCTCCC
      Following this alignment:
      readA AAACATGAAAAGAAATGATGAAACAGA readB AAATCATGAAACAGAGCCTCATCTCCC
      How can I fix it?

      ---
      neversaint and everlastingly indebted.......
        How can I fix it?

        By giving better examples :) I had assumed on the basis of the sample supplied, that the overlapping sections were identical as in both cases supplied. I think javafans make that same assumption.

        You'd need to accumulate the overlaps in a third array in the while, and then add another for loop to process them is the same way as @heads & @tails. (There obviously scope there for making the for loop a subroutine.

        I'm done for today, but I'll be back tonight.

        As I suggested:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; my $k = 15; my %repo = ( "readA" => "AAACATGAAAAGAAATGATGAAACAGA", "readB" => "AAATCATGAAACAGAGCCTCATCTCCC ", ); my @order = ( "readB_1","readA_2"); #my $k = 25; #my %repo = ( # "readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", # "readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", # "readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", # "readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", #); #my @order = ( "readA_1", "readB_2", "readC_1", "readD_2"); sub muster { local $^W; my( $bits ) = @_; my $muster = ''; for my $p ( 0 .. length( $bits->[0] )-1 ) { my %uniq; ++$uniq{ substr $bits->[ $_ ], $p, 1 } for 0 .. $#$bits; if( keys %uniq > 1 ) { $muster .= '(' . join( ',', keys %uniq ) . ')'; } else { $muster .= each %uniq; } } return $muster; } my( @heads, @bodys, @tails ); while( @order ) { my( $s1, $p1, $s2, $p2 ) = map split( '_', shift @order ), 1 .. 2; ( $s1, $s2 ) = ( $s2, $s1 ) if $p1 > $p2; push @heads, substr $repo{ $s2 }, 0, length( $repo{ $s2 } ) - $k; push @bodys, substr $repo{ $s2 }, -$k; push @tails, substr $repo{ $s1 }, $k; push @bodys, substr $repo{ $s1 }, 0, $k; } my $head = muster( \@heads ); my $body = muster( \@bodys ); my $tail = muster( \@tails ); print $head, $body, $tail; __END__ TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G) TACTC(A,G)GGAGGCTGAGGCAGGAGAATTGCTTGAAC(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G) AAACATGAAAAGAAAT(C,G)ATGAAACAGAGCCTCATCTCCC AAACATGAAAAGAAAT(G,C)ATGAAACAGAGCCTCATCTCCC

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Mustering Reads (definition?)
by LanX (Archbishop) on Nov 01, 2010 at 11:11 UTC
    Sorry, could you please explain the meaning of "mustering" or provide a link with a definition?

    I tried to google it, but without much success and I'm not proficient in the arts of bioinformatics.

    Cheers Rolf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://868716]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2020-02-25 19:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (113 votes). Check out past polls.

    Notices?