Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Removing nucleotide frm sequence

by bingalee (Acolyte)
on Jun 06, 2013 at 14:38 UTC ( #1037465=perlquestion: print w/ replies, xml ) Need Help??
bingalee has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm a beginner to perl programming. I need to create a script for removing the nucleotides from many sequences. My data looks something like this

@HWI-.blah blah.......................:TGACCA

GTAGGGGCTGCGCGAACGCAAACCCCCGCTGCCACAAATGATCGTCGGACTGTAGAA

CTCTGAACGTGTAGATCTCGGTGGCCGCCGTATCATTAAAAAAA

+

?1=.....blah blah......................................>(:@@CB+8(9>@:@CCBB289(259@B9B8?A:@C@>CC@B

this is like one set, there are many sets like this in the file. so if i want to remove the last 5 "a" frm the sequence, and its corresponding quality (>CC@B) and do this for all the sequences, how do i go about it. First I thought i should split it into arrays using the '+' but then i will have to remove the last five elements of each element of the array. and join them and resplit them differently so that the next time i can remove the last 5"quality" data from each element of the array. I'm sure there's a less complicated procedure..can anyone help mme out here please?

@HWI-ST1023:184:C1V8LACXX:7:1101:1142:2247 2:N:0:TGACCA GTAGGGGCTGCGCGAACGCAAACCCCCGCTGCCACAAATGATCGTCGGACTGTAGAACTCTGAACGTGTA +GATCTCGGTGGCCGCCGTATCATTAAAAAAA + ?1=DBB@DCFFFFIGIIII6DGHHIII6@=AEEDDEEC;@C>@?(;;B;@B?9BCDAA3>(:@@CB+8(9 +>@:@CCBB289(259@B9B8?A:@C@>CC@B @HWI-ST1023:184:C1V8LACXX:7:1101:1450:2022 2:N:0:TGACCA ACGTGCCCTCGGCCAGAAGGCTTGGGGCGCAACTTGCGTTCAAAGACTCGATGGTTCACGGGATTCTGCA +ATTCACACCAAGTATCGCATTTCGCTACGTT + ?@@DDDFFADFFHIJIIFG>FHIJJJJJGIIBH=DHGHHDDFFF; AEAC?=>CD-:@CDBDBDBDD>CDDD:ACDCDDDDD?(4>CBBD?@DDDDDDDD8? @HWI-ST1023:184:C1V8LACXX:7:1101:1457:2047 2:N:0:TGACCA GCGTCGCCAGCACAGAGGCCATGCGATCCGTCGAGTTATCATGAATCATCAGAGCAACGGGCAGAGCCCG +CGTCGACCTTTTATCTAATAAATGCGTCCCT + @CCDFFFFGHHHHJIIIJJIJJJJIIJJJJFHIBFBFHIGJJIGI@GHGGEHHHHHHFFDDABDDDDDDD +DDDDBDBBBDCCCCCDDDDCDDEECB8<@DD

sorry if I framed my question wrong

So i need to remove the last 5 Nucleotides from each sequence, irrespective of whether its an "a" or not, sorry if i said so otherwise.

Also i need to remove the corresponding quality of the nucleotides which are basically the symbol like characters.Like in the first sequence if I'm removing "AAAAA" i need to also remove ">CC@B".

is it doable? :(

Comment on Removing nucleotide frm sequence
Download Code
Re: Removing nucleotide frm sequence
by space_monk (Chaplain) on Jun 06, 2013 at 14:42 UTC
    You probably want something more refined than this, but something similar to:
    while (<>) { # $ denotes end of string s/AAAAA$//; s/>CC\@B$//; print; }
    Not quite sure if your sequence is on one line or multiple lines, so the above will need some tweaking. If you need some more help, please amend your question, putting a complete sequence in code blocks
    If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
      Hey, thanks for that..but the quality and the last five nucleotides aren't the same in every sequence. It can also be like AAACT in another sequence..:(

        Please can you put a complete sequence or two in the comment, with a short explanation of what needs removing. Many of us aren't chemists (I only got a low grade at Chem A level 30 years ago :-)

        If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
      local $/ = ""; while (<>) { # should have a whole record # split up 4 line record my (@line) = split( "\n"); # change if necessary to check we have a valid record if ($line[0] =~ /^@/) { $line[1] =~ s/([ACGT]{5})$//; $line[3] =~ s/(\w{5})$//; print join("\n", @line); } }
      If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
Re: Removing nucleotide frm sequence
by hdb (Parson) on Jun 06, 2013 at 14:46 UTC

    Not sure what you want, really, but to remove the last 5 chars from a $string, you can do

    substr( $string, -5, 5 ) = '';

    UPDATE: I am still not sure what you want but it seems that there are groups of 4 lines and you want to shorten the second and fourth line of each group by 5 chars?

    use strict; use warnings; use Data::Dumper; my @lines = <DATA>; # assume groups of four lines chomp( @lines ); my $n = 5; # to be cut off for( my $i=0; $i<@lines; $i+=4 ) { substr( $lines[$i+1], -$n, $n ) = ''; substr( $lines[$i+3], -$n, $n ) = ''; } print Dumper( \@lines ); __DATA__ @HWI-ST1023:184:C1V8LACXX:7:1101:1142:2247 2:N:0:TGACCA GTAGGGGCTGCGCGAACGCAAACCCCCGCTGCCACAAATGATCGTCGGACTGTAGAACTCTGAACGTGTA +GATCTCGGTGGCCGCCGTATCATTAAAAAAA + ?1=DBB@DCFFFFIGIIII6DGHHIII6@=AEEDDEEC;@C>@?(;;B;@B?9BCDAA3>(:@@CB+8(9 +>@:@CCBB289(259@B9B8?A:@C@>CC@B @HWI-ST1023:184:C1V8LACXX:7:1101:1450:2022 2:N:0:TGACCA ACGTGCCCTCGGCCAGAAGGCTTGGGGCGCAACTTGCGTTCAAAGACTCGATGGTTCACGGGATTCTGCA +ATTCACACCAAGTATCGCATTTCGCTACGTT + ?@@DDDFFADFFHIJIIFG>FHIJJJJJGIIBH=DHGHHDDFFF;AEAC?=>CD-:@CDBDBDBDD>CDD +D:ACDCDDDDD?(4>CBBD?@DDDDDDDD8? @HWI-ST1023:184:C1V8LACXX:7:1101:1457:2047 2:N:0:TGACCA GCGTCGCCAGCACAGAGGCCATGCGATCCGTCGAGTTATCATGAATCATCAGAGCAACGGGCAGAGCCCG +CGTCGACCTTTTATCTAATAAATGCGTCCCT + @CCDFFFFGHHHHJIIIJJIJJJJIIJJJJFHIBFBFHIGJJIGI@GHGGEHHHHHHFFDDABDDDDDDD +DDDDBDBBBDCCCCCDDDDCDDEECB8<@DD
      Yes, I'm trying to shorten the 2nd and 4th line from each group. But there are like 1000 groups. Thanks for your code, I'll try it :)
        er..sorry, the above reply was by me. Didnt log in :P

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037465]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-09-01 15:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (15 votes), past polls