Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Print the data following few specific lines in perl

by ad23 (Acolyte)
on Jul 01, 2010 at 16:22 UTC ( [id://847547]=perlquestion: print w/replies, xml ) Need Help??

ad23 has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I am trying to parse a fasta file using perl. The following is the input file:

>CVSF43565.d1 bg|346278 CAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGG ATCCGGGTAAGCAACGGGTCCACATATCTCCACAATCTCATAAGGGGCCAACATAGCGGGGGAGCTAACT TGCCTTTGATTCCAAACCGTTGCACTCCTTTGGTCGGGGAAACTCGAAGGTACACATGATCACCAAGGTC GAACTGCAGGGGTCTTCTCCGCTGGTCGGAGTAGCTCTTCTGTCAAGATTGGGCGGCCTTGAGATGTGCT TGAATTACCTTCACTTGCTCTTCGGCTTCTGCCACTTAAGTCAGGGCCATAGACCTGTCTCTCCCCTGGG CAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGG ATCCGGGTAAGC >CVSF43566.d1 bg|346279 CAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGG ATCCGGGTAAGCCAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGT TCTGGGAGTTGGAATGCTAGTCGATCGCCAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATG TTGGCAGACAACTGTTCTGGGAGTTGGATCCGGGTAAGCCAGACACACTTCTTTTAGTTGAGACACATGG AAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGGATCCGGGTAAGC >CVSF43567.d1 bg|346280 CGTAGCTGATGCTGTGCTGTTGTGTCGGGGGGATATATATATATATATGGGGTCGTAGTCGTAGCGCTAG TATGCTAGCAGCGTAGATGCTGATCGATGCTGATGCTGATCGTAGTCGTAGGCTAGTGCGATCGTAGTCG TAGTCGATGCTGATGCGTAGCTGATGTGCTGCTGATGCTAGTCGTCGTAGCTGATGCATGCTGATCGTAG TGCTCGATGCTAGTCGTAGTCGTAGTCGTAGCGACTGATGCGATCGTAGTCGGATGCTAGCACGTAGCTG GCTCGATGCTGATGCTGAT >CVSF10000.x1 bg|356789 pair:789860 ATGCGTAGCTGATGTGCTGCTGATGCTAGTCGTCGTAGCTGATGCATGCTGATCGTAGTGCTCGATGCTA GTCGTAGTCGTAGTCGTAGCGACTGATGCGATCGTAGTCGGATGATGCTGACTGATGCTGATCTGTACGT CGTAGCTGATGCATGCGCTAGTAGCT >CVSF10000.y1 bg|356790 pair:789859 GCTAGTCGATGCTGATGCTGTAGCTAGCGTAGTCGTACGCGCGCGCGCGCGTTTTTTGTGACGTCGTAGT CCGTAGCTGATGCGATGCTAGTGCTGTGTCAGCTGATGTCGTGTGTAGCTGATGCTGATCGTTCGTGTGT CGATGCTGATGCTAGTCGTAGTGTAT >CVSF10001.x1 bg|356791 pair:789862 AGTCGTAGTCGTAGCTGTAGCTGATGCTGTGTACGATGCTGATGCGATGCGTAGCGTAGCATCGATGCTA CGACTAGTCGTAGTCGTC >CVSF10001.y1 bg|356792 pair:789861 CGTAGCTGATGCTGATCGTAGTCGTAGTCGATGCGATGCTAGTCGTAGCTGTAGCTGATGCTGCGTGCTG CAGTCGATGCTAGTCGATGCTGATCGTCTAGCAT

I want to write the lines(and the data that follows) with "pairs" field in one file and the lines without "pairs" field in another.

However, with the following code I am only able to write the header lines. But I also want the data following the header line(ATGCTAGCTG....) to be included in the output files.

Any inputs??

#!/usr/bin/perl my $in = $ARGV[0]; my $p = $ARGV[1]; my $s = $ARGV[2]; open IN, "<$in" or die $!; open P_OUT, ">$p" or die $!; open S_OUT, ">$s" or die $!; while(<IN>){ chomp; if(/^>/){ my @header = split / /; if($header[2] ne ''){ print P_OUT "$header[0]"." "."$header[1]"." "."$header[2]\n"; } else{ print S_OUT "$header[0]"." "."$header[1]\n"; } } #unless(/^>/){ #print OUT "$_\n"; #next; #} } close(IN); close(P_OUT); close(S_OUT);

Thanks!!!

Replies are listed 'Best First'.
Re: Print the data following few specific lines in perl
by BioLion (Curate) on Jul 01, 2010 at 16:36 UTC

    A different approach might be to use the $/ variable ( see perlvar for details ) to read in each record and then test whether it has the 'pairs' field:

    use warnings; use strict; { ## modify input record separator to read ## each FASTA record separately local $/ = '>'; ## test each record, printing out if it has ## pair field while (<DATA>){ print $_ if ($_ =~ /pair:\d+/); } } __DATA__ >CVSF43567.d1 bg|346280 CGTAGCTGATGCTGTGCTGTTGTGTCGGGGGGATATATATATATATATGGGGTCGTAGTCGTAGCGCTAG TATGCTAGCAGCGTAGATGCTGATCGATGCTGATGCTGATCGTAGTCGTAGGCTAGTGCGATCGTAGTCG TAGTCGATGCTGATGCGTAGCTGATGTGCTGCTGATGCTAGTCGTCGTAGCTGATGCATGCTGATCGTAG TGCTCGATGCTAGTCGTAGTCGTAGTCGTAGCGACTGATGCGATCGTAGTCGGATGCTAGCACGTAGCTG GCTCGATGCTGATGCTGAT >CVSF10000.x1 bg|356789 pair:789860 ATGCGTAGCTGATGTGCTGCTGATGCTAGTCGTCGTAGCTGATGCATGCTGATCGTAGTGCTCGATGCTA GTCGTAGTCGTAGTCGTAGCGACTGATGCGATCGTAGTCGGATGATGCTGACTGATGCTGATCTGTACGT CGTAGCTGATGCATGCGCTAGTAGCT >CVSF10000.y1 bg|356790 pair:789859 GCTAGTCGATGCTGATGCTGTAGCTAGCGTAGTCGTACGCGCGCGCGCGCGTTTTTTGTGACGTCGTAGT CCGTAGCTGATGCGATGCTAGTGCTGTGTCAGCTGATGTCGTGTGTAGCTGATGCTGATCGTTCGTGTGT CGATGCTGATGCTAGTCGTAGTGTAT

    In this case I am simply testing for the presence of the 'pair' field using a regex (see perlre). Hope this helps!

    Just a something something...

      Thank you for your help.

      The code which I am trying to run - what is going wrong in there?? I am a beginner, and it would really help if I know how to modify my code to get the desired output??

      Cheers!

        It isn't really a case of right or wrong - you code is doing exactly what you are telling it to - sorting the header lines based on the descriptor field. As you will see all over the place, TIMTOWTDI (there is more than one way to do it) is a core part of Perl programming, so while your method could be made to work, it would be much more complicated than needed (you would need to buffer the squence associated with each header and print them out together, based on testing the header) and ultimately much harder to maintain etc...

        Just try to work out (on paper if needs be) what steps you need to achieve, than code each one of those steps - shortcuts/better/different solutions may occur to you as you work, but the important thing is that the code is clear and well documented.

        Maybe this doesn't answer your question, but really it is all about practice and experience, so the fact that you have made a good attempt shows that in time you'll be laughing at these sort of problems. Keep plugging away and if you get stuck again, come to the Monastery!

        Just a something something...
Re: Print the data following few specific lines in perl
by toolic (Bishop) on Jul 01, 2010 at 18:06 UTC
    You can use select to choose between your 2 output files:
    use strict; use warnings; my $in = $ARGV[0]; my $p = $ARGV[1]; my $s = $ARGV[2]; open IN , "<$in" or die $!; open P_OUT, ">$p" or die $!; open S_OUT, ">$s" or die $!; while (<IN>) { if (/^>/) { if (/pair:/) { select P_OUT; } else{ select S_OUT; } } print; } close(IN); close(P_OUT); close(S_OUT);

      Thanks for your help.

      But all the above codes are printing just the header lines. I still am not able to print the data after the header lines for each record.

      Am I am not seeing the obvious???

        It seems to work for me.

        Download my code and save it as a file named 847578.pl. Then download your data and save it in a file named fasta.txt. Then execute this at your command line:

        perl ./847578.pl fasta.txt p.txt s.txt

        I get 2 output files. This is the contents of file p.txt:

        >CVSF10000.x1 bg|356789 pair:789860 ATGCGTAGCTGATGTGCTGCTGATGCTAGTCGTCGTAGCTGATGCATGCTGATCGTAGTGCTCGATGCTA GTCGTAGTCGTAGTCGTAGCGACTGATGCGATCGTAGTCGGATGATGCTGACTGATGCTGATCTGTACGT CGTAGCTGATGCATGCGCTAGTAGCT >CVSF10000.y1 bg|356790 pair:789859 GCTAGTCGATGCTGATGCTGTAGCTAGCGTAGTCGTACGCGCGCGCGCGCGTTTTTTGTGACGTCGTAGT CCGTAGCTGATGCGATGCTAGTGCTGTGTCAGCTGATGTCGTGTGTAGCTGATGCTGATCGTTCGTGTGT CGATGCTGATGCTAGTCGTAGTGTAT >CVSF10001.x1 bg|356791 pair:789862 AGTCGTAGTCGTAGCTGTAGCTGATGCTGTGTACGATGCTGATGCGATGCGTAGCGTAGCATCGATGCTA CGACTAGTCGTAGTCGTC >CVSF10001.y1 bg|356792 pair:789861 CGTAGCTGATGCTGATCGTAGTCGTAGTCGATGCGATGCTAGTCGTAGCTGTAGCTGATGCTGCGTGCTG CAGTCGATGCTAGTCGATGCTGATCGTCTAGCAT

        This is the contents of file s.txt:

        >CVSF43565.d1 bg|346278 CAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGG ATCCGGGTAAGCAACGGGTCCACATATCTCCACAATCTCATAAGGGGCCAACATAGCGGGGGAGCTAACT TGCCTTTGATTCCAAACCGTTGCACTCCTTTGGTCGGGGAAACTCGAAGGTACACATGATCACCAAGGTC GAACTGCAGGGGTCTTCTCCGCTGGTCGGAGTAGCTCTTCTGTCAAGATTGGGCGGCCTTGAGATGTGCT TGAATTACCTTCACTTGCTCTTCGGCTTCTGCCACTTAAGTCAGGGCCATAGACCTGTCTCTCCCCTGGG CAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGG ATCCGGGTAAGC >CVSF43566.d1 bg|346279 CAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGG ATCCGGGTAAGCCAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATGTATGGCAGACAACTGT TCTGGGAGTTGGAATGCTAGTCGATCGCCAGACACACTTCTTTTAGTTGAGACACATGGAAAACATCATG TTGGCAGACAACTGTTCTGGGAGTTGGATCCGGGTAAGCCAGACACACTTCTTTTAGTTGAGACACATGG AAAACATCATGTATGGCAGACAACTGTTCTGGGAGTTGGATCCGGGTAAGC >CVSF43567.d1 bg|346280 CGTAGCTGATGCTGTGCTGTTGTGTCGGGGGGATATATATATATATATGGGGTCGTAGTCGTAGCGCTAG TATGCTAGCAGCGTAGATGCTGATCGATGCTGATGCTGATCGTAGTCGTAGGCTAGTGCGATCGTAGTCG TAGTCGATGCTGATGCGTAGCTGATGTGCTGCTGATGCTAGTCGTCGTAGCTGATGCATGCTGATCGTAG TGCTCGATGCTAGTCGTAGTCGTAGTCGTAGCGACTGATGCGATCGTAGTCGGATGCTAGCACGTAGCTG GCTCGATGCTGATGCTGAT
Re: Print the data following few specific lines in perl
by biohisham (Priest) on Jul 01, 2010 at 23:37 UTC
    A simpler and more robust approach is to let BioPerl handle that for you since it can understand the FastA format quite well and that it has good measures for IO operations, check Perl and Bioinformatics for a general introductory.

    The code below reads into two hashes the provided sequences depending on whether they have pair info or not at the descriptor portion of their corresponding records, writing each one of these hashes into a file is left as an exercise for the OP..

    use strict; use warnings; use Bio::SeqIO; my $file = "FastA.txt"; my $seq_obj = Bio::SeqIO->new(-file =>"<$file",-format=>'fasta',); my %seqPair; my %seq_NoPair; while(my $seq = $seq_obj->next_seq){ #iterate through records if($seq->desc =~ /.*?pair.*?/i){ $seqPair{$seq->id}=$seq->seq; }else{ $seq_NoPair{$seq->id}=$seq->seq; } } #use Data::Dump qw(pp); #print pp \%seqPair; #print pp \%seq_NoPair;


    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://847547]
Approved by BioLion
Front-paged by biohisham
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2025-03-18 13:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (57 votes). Check out past polls.