Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

splitting fasta file into individual fasta files

by lomSpace (Scribe)
on May 06, 2009 at 21:18 UTC ( [id://762397]=perlquestion: print w/replies, xml ) Need Help??

lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I am using BioPerl to parse a file with two fasta sequences.
The separator is '>'.
#!/usr/bin/perl use warnings; use strict; use Bio::SeqIO; use Data::Dumper; my $file = "C:/Documents and Settings/mydir/fasta.contigs"; my $seqIO = Bio::SeqIO->new({-file=>'$file', -format=>'fasta'}); while (my $seq = $seqIO->next_seq()) { my $out = Bio::SeqIO->new( -file => ">$file"."_fasta", -format => 'Fasta'); $out->write_seq($seq); } __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

Any ideas?

Replies are listed 'Best First'.
Re: splitting fasta file into individual fasta files
by BrowserUk (Patriarch) on May 06, 2009 at 21:39 UTC
Re: splitting fasta file into individual fasta files
by moritz (Cardinal) on May 06, 2009 at 21:32 UTC
    Any ideas?

    On what? If you asked a question, we could have answered it, maybe.

      How can I split a file
      __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

      based on the delimiter '>' and then print the lines
      in between '>' and the next '>' then continue until I have
      separate files like this
      file1:
      >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT
      And file2:
      >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA
      I am running into some trouble with this code:
      #!/usr/bin/perl use warnings; use strict; use Data::Dumper; # # split a fasta file into separate sequence files # open( my $seqs, "C:/Documents and Settings/mydir/13063_fasta.contigs") +; open(my $seq_out,">C:/Documents and Settings/mydir/contig.fa" ); $/ = '\777'; # entire input to be read in one slurp $seqs = <>; # read input, assigning to single string while (<$seqs>){ if($seqs =~ m/^(>[^>]+)/mg) { # match indiv. sequences by '>'s push(my @seqs,$1); # and store in array } for (my @seqs) { # only allow characters A-Z,a-z,0-9,'_','-', and '.' in names; # change if you're more liberal /^> *([\w\-\.]+)/ && (my $seq_name = $1); if ($seq_name) { open($seq_out,">$seq_name"); print $seq_out "$_"; } else { warn "couldn't recognise the sequence name in \n$_"; } } } close($seqs); close($seq_out);
      Sorry for not being specific enough :-)

        Actually I was interested in what kinds of trouble you ran into, but anyway...

        There's a trick in Perl that makes that quite easy: you can set the input record separator to '>', so that the <DATA> iterator gives you chunks splitted by '>':

        use strict; use warnings; local $/ = '>'; my $c = 0; while (<DATA>) { chomp; next unless length; my $fn = "file" . ++$c; open my $handle, '>', $fn or die "Can't open `$fn' for writing: $! +"; print $handle '>', $_; close $handle or warn $!; } __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

        You can use Tie::File::AnyData::Bio::Fasta to facilitate this task. This script reads a multi-fasta file and splits it in multiple single-fasta files:

        use strict; use warnings; use Tie::File::AnyData::Bio::Fasta; tie my @in,'Tie::File::AnyData::Bio::Fasta', shift @ARGV, or die $!; my $n = 0; for my $seq (@in){ tie my @out, 'Tie::File::AnyData::Bio::Fasta', "$n.fa" or die $!; $n++; @out = ($seq); untie @out; } untie @in;

        citromatik

Re: splitting fasta file into individual fasta files
by bichonfrise74 (Vicar) on May 07, 2009 at 00:53 UTC
    Try something in this manner...
    #!/usr/bin/perl use strict; my @file = do { local $/ = ">"; <DATA> }; print map { "$_\n" } @file; __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

      Beware that, with your code, the first element of the array is a single ">" character, and the rest of elements (but the last) will have the ">" at the end

      citromatik

      Thank you bichonfrise74 for sharing your wisdom with a the neophyte!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://762397]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-03-19 05:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found