Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Perl Monk, Perl Meditation
 
PerlMonks  

splitting fasta file into individual fasta files

by lomSpace (Scribe)
on May 06, 2009 at 21:18 UTC ( #762397=perlquestion: print w/ replies, xml ) Need Help??
lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I am using BioPerl to parse a file with two fasta sequences.
The separator is '>'.
#!/usr/bin/perl use warnings; use strict; use Bio::SeqIO; use Data::Dumper; my $file = "C:/Documents and Settings/mydir/fasta.contigs"; my $seqIO = Bio::SeqIO->new({-file=>'$file', -format=>'fasta'}); while (my $seq = $seqIO->next_seq()) { my $out = Bio::SeqIO->new( -file => ">$file"."_fasta", -format => 'Fasta'); $out->write_seq($seq); } __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

Any ideas?

Comment on splitting fasta file into individual fasta files
Download Code
Re: splitting fasta file into individual fasta files
by moritz (Cardinal) on May 06, 2009 at 21:32 UTC
    Any ideas?

    On what? If you asked a question, we could have answered it, maybe.

      How can I split a file
      __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

      based on the delimiter '>' and then print the lines
      in between '>' and the next '>' then continue until I have
      separate files like this
      file1:
      >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT
      And file2:
      >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA
      I am running into some trouble with this code:
      #!/usr/bin/perl use warnings; use strict; use Data::Dumper; # # split a fasta file into separate sequence files # open( my $seqs, "C:/Documents and Settings/mydir/13063_fasta.contigs") +; open(my $seq_out,">C:/Documents and Settings/mydir/contig.fa" ); $/ = '\777'; # entire input to be read in one slurp $seqs = <>; # read input, assigning to single string while (<$seqs>){ if($seqs =~ m/^(>[^>]+)/mg) { # match indiv. sequences by '>'s push(my @seqs,$1); # and store in array } for (my @seqs) { # only allow characters A-Z,a-z,0-9,'_','-', and '.' in names; # change if you're more liberal /^> *([\w\-\.]+)/ && (my $seq_name = $1); if ($seq_name) { open($seq_out,">$seq_name"); print $seq_out "$_"; } else { warn "couldn't recognise the sequence name in \n$_"; } } } close($seqs); close($seq_out);
      Sorry for not being specific enough :-)

        Actually I was interested in what kinds of trouble you ran into, but anyway...

        There's a trick in Perl that makes that quite easy: you can set the input record separator to '>', so that the <DATA> iterator gives you chunks splitted by '>':

        use strict; use warnings; local $/ = '>'; my $c = 0; while (<DATA>) { chomp; next unless length; my $fn = "file" . ++$c; open my $handle, '>', $fn or die "Can't open `$fn' for writing: $! +"; print $handle '>', $_; close $handle or warn $!; } __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

        You can use Tie::File::AnyData::Bio::Fasta to facilitate this task. This script reads a multi-fasta file and splits it in multiple single-fasta files:

        use strict; use warnings; use Tie::File::AnyData::Bio::Fasta; tie my @in,'Tie::File::AnyData::Bio::Fasta', shift @ARGV, or die $!; my $n = 0; for my $seq (@in){ tie my @out, 'Tie::File::AnyData::Bio::Fasta', "$n.fa" or die $!; $n++; @out = ($seq); untie @out; } untie @in;

        citromatik

Re: splitting fasta file into individual fasta files
by BrowserUk (Pope) on May 06, 2009 at 21:39 UTC
Re: splitting fasta file into individual fasta files
by bichonfrise74 (Vicar) on May 07, 2009 at 00:53 UTC
    Try something in this manner...
    #!/usr/bin/perl use strict; my @file = do { local $/ = ">"; <DATA> }; print map { "$_\n" } @file; __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

      Beware that, with your code, the first element of the array is a single ">" character, and the rest of elements (but the last) will have the ">" at the end

      citromatik

      Thank you bichonfrise74 for sharing your wisdom with a the neophyte!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://762397]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2014-04-21 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (490 votes), past polls