Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to match the sequences with headers

by anonym (Acolyte)
on Oct 23, 2011 at 12:57 UTC ( [id://933202]=perlquestion: print w/replies, xml ) Need Help??

anonym has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a multifasta file that has mix of protein amino acid fasta sequences and protein secondary structures fasta sequences.I have made four arrays to put the header lines for amino acid sequences into one array, the header lines of secondary structures into another array, the amino acid sequences into another array, the secondary structures into one array. My code is:

my @headerseq; my @headersec; my @sequence; my @secseq; while (<IFILE>) { chomp; if ($_ =~ /sequence/){ @headerseq = split('\n',$_); # print "@headerseq\n"; } elsif ($_ =~ /secstr/){ @headersec = split ('\n',$_); # print "@headersec\n"; } elsif ($_ =~ /^(\w*)$/) { @sequence = $_; # print "@sequence\n"; } else { @secseq = split ('\n',$_); # print "@secseq\n"; } foreach my $headerseq(@headerseq){ foreach my $sequence(@sequence) { print OFILE1 "$headerseq\n$sequence\n"; } } }

However, this foreach loop is not matching the header lines of protein sequences in @headerseq array with the sequences in @sequence array properly.How should I proceed with this? I want to match the headers of protein seqs with protein sequences in outfile and in another outfile,I wish to match the secondary structures header lines with the Sec str sequences.Thanks..

Replies are listed 'Best First'.
Re: How to match the sequences with headers
by AnomalousMonk (Archbishop) on Oct 23, 2011 at 16:52 UTC
    while (<IFILE>) { ... elsif ($_ =~ /^(\w*)$/) { @sequence = $_; } ... foreach my $sequence(@sequence) { print OFILE1 "$headerseq\n$sequence\n"; } ... }

    I'm not familiar with processing FASTA sequence data, but a couple of things strike me in the code fragment originally posted by anonym:

    • The statement
          @sequence = $_;
      will only assign a single element to the  @sequence array which, by the operation of the posted code alone, will never contain other than zero or one elements. Then
          foreach my $sequence(@sequence) {
          print OFILE1 "$headerseq\n$sequence\n";
          }
      loops over this zero or one element array to output to a file. This seems suspicious.
    • I'm assuming that other code sets up  $/ to read multi-line FASTA records. The regex  /^(\w*)$/ in the  elsif condition statement above will only match records that have no whitespace (except that  $ could match a newline at the very end of the string), but all the records in the sample input shown by anonym have embedded whitespace.

      Thanks for your replies.I did get the two output files now ,one having the headers and associated protein sequences and the other with headers of Sec structures and associated structures.

      if ($_ =~ /sequence/){ @headerseq = split('\n',$_); print OFILE1 "@headerseq\n"; } elsif ($_ =~ /^(\w*)$/){ @sequence = $_; print OFILE1 "@sequence\n"; } elsif ($_ =~ /secstr/) { @headersec = split('\n',$_); print OFILE2 "@headersec\n"; } else { @secseq = $_; print OFILE2 "@secseq\n"; }
Re: How to match the sequences with headers
by Anonymous Monk on Oct 23, 2011 at 13:47 UTC

      The sample input is:

      >101M:A:sequence MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVL +TALGA ILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAA +KYKEL GYQG >101M:A:secstr HHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH GGGGGG TTTTT SHHHHHH HHHHHHHHHHH +HHHHH HHTTTT HHHHHHHHHHHHHTS HHHHHHHHHHHHHHHHHH GGG SHHHHHHHHHHHHHHHHHHHH +HHHHT T >102L:A:sequence MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQ +DVDAA VRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPN +RAKRV ITTFRTGTWDAYKNL >102L:A:secstr HHHHHHHHH EEEEEE TTS EEEETTEEEESSS TTTHHHHHHHHHHTS TTB HHHHHHHHHH +HHHHH HHHHHH TTHHHHHHHS HHHHHHHHHHHHHHHHHHHHT HHHHHHHHTT HHHHHHHHHSSHHHHHSHH +HHHHH HHHHHHSSSGGG

      The first output should be all protein fasta sequences like:

      >101M:A:sequence MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVL +TALGA ILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAA +KYKEL GYQG >102L:A:sequence MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQ +DVDAA VRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPN +RAKRV ITTFRTGTWDAYKNL

      The second Output file should hav secondary structures:

      >101M:A:secstr HHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH GGGGGG TTTTT SHHHHHH HHHHHHHHHHH +HHHHH HHTTTT HHHHHHHHHHHHHTS HHHHHHHHHHHHHHHHHH GGG SHHHHHHHHHHHHHHHHHHHH +HHHHT T >102L:A:secstr HHHHHHHHH EEEEEE TTS EEEETTEEEESSS TTTHHHHHHHHHHTS TTB HHHHHHHHHH +HHHHH HHHHHH TTHHHHHHHS HHHHHHHHHHHHHHHHHHHHT HHHHHHHHTT HHHHHHHHHSSHHHHHSHH +HHHHH HHHHHHSSSGGG
      Thanks..

        At the beginning of your program put in $/="\n\n";

        Then inside the while loop if $_ matches 'sequence' print $_ to out1; if it matches 'secstr' print $_ to out2.

Re: How to match the sequences with headers
by Lotus1 (Vicar) on Oct 23, 2011 at 14:57 UTC
    while (<IFILE>) { chomp; if ($_ =~ /sequence/){ @headerseq = split('\n',$_);

    You don't show where you have set $/ to some kind of record seperator so you are reading in one line at a time. That means split('\n',$_) will return one element, the whole line. Each time you are setting @headerseq to the line that matches /sequence/.

    The same applies to the other arrays. You are not actually using them as arrays since they only ever hold one element in this setup.

Re: How to match the sequences with headers
by Anonymous Monk on Oct 23, 2011 at 14:34 UTC

    You are sequentially inputting your data into four discrete arrays. Instead, you should input into a complex data structure, such as an array of arrays (perllol, more generally perldsc), to keep coherency.

      I am sorry but I dint understand how you put all the four arrays into array of arrays without going with regex.Thanks..

Re: How to match the sequences with headers
by Cristoforo (Curate) on Jun 14, 2012 at 04:21 UTC
    Note how simple the BioPerl solution is.

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "933212.txt" , -format => 'fasta'); my $out_seq = Bio::SeqIO->new( -file => '>sequence.dat', -format => 'fasta'); my $out_secstr = Bio::SeqIO->new( -file => '>secstr.dat', -format => 'fasta'); while ( my $seq = $in->next_seq() ) { if ($seq->id =~ /sequence$/) { $out_seq->write_seq($seq); } elsif ($seq->id =~ /secstr$/) { $out_secstr->write_seq($seq); } }

    Chris

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://933202]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-03-28 20:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found