Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: splitting fasta file into individual fasta files

by lomSpace (Scribe)
on May 06, 2009 at 21:57 UTC ( #762415=note: print w/ replies, xml ) Need Help??


in reply to Re: splitting fasta file into individual fasta files
in thread splitting fasta file into individual fasta files

How can I split a file

__DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA

based on the delimiter '>' and then print the lines
in between '>' and the next '>' then continue until I have
separate files like this
file1:
>i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT
And file2:
>i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA
I am running into some trouble with this code:
#!/usr/bin/perl use warnings; use strict; use Data::Dumper; # # split a fasta file into separate sequence files # open( my $seqs, "C:/Documents and Settings/mydir/13063_fasta.contigs") +; open(my $seq_out,">C:/Documents and Settings/mydir/contig.fa" ); $/ = '\777'; # entire input to be read in one slurp $seqs = <>; # read input, assigning to single string while (<$seqs>){ if($seqs =~ m/^(>[^>]+)/mg) { # match indiv. sequences by '>'s push(my @seqs,$1); # and store in array } for (my @seqs) { # only allow characters A-Z,a-z,0-9,'_','-', and '.' in names; # change if you're more liberal /^> *([\w\-\.]+)/ && (my $seq_name = $1); if ($seq_name) { open($seq_out,">$seq_name"); print $seq_out "$_"; } else { warn "couldn't recognise the sequence name in \n$_"; } } } close($seqs); close($seq_out);
Sorry for not being specific enough :-)


Comment on Re^2: splitting fasta file into individual fasta files
Select or Download Code
Replies are listed 'Best First'.
Re^3: splitting fasta file into individual fasta files
by moritz (Cardinal) on May 06, 2009 at 22:07 UTC

    Actually I was interested in what kinds of trouble you ran into, but anyway...

    There's a trick in Perl that makes that quite easy: you can set the input record separator to '>', so that the <DATA> iterator gives you chunks splitted by '>':

    use strict; use warnings; local $/ = '>'; my $c = 0; while (<DATA>) { chomp; next unless length; my $fn = "file" . ++$c; open my $handle, '>', $fn or die "Can't open `$fn' for writing: $! +"; print $handle '>', $_; close $handle or warn $!; } __DATA__ >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig1 TTTCTCCGGCCCCCTCCTCCCGCGGGGGAAAAAACCCGGGGAGCAGTCGG GCAGGGGTTTTTTGGTTTTTTCAAAATAAAAAGGGGTGCCCGTTGGGGGGcd GGGGGGGTGCAGGTTTCAACCCCCCCCCCCAAAGAAAAAAAAATTTTGGG GAATTTTTGGGGGGCTCCACCAGTTTTCGGGGTTTTTGGGCCTTTTCAGA AGGTAGGTTGGACGCGGATTGGGCAATAAACCACCCCGCTTCATCGGATA ATTTTCCCCGGCCGAAAAGGGCCGCGGGGCCGGTGGGCGGCCTTGGGTTT >i:/13414 Ccl9 (88)/sequencing/13414_fasta.Contig2 TAAACCCAAGGCCCCCCAGGTAAAAAAAAAACCGGCCAGGGGGGGGGGGG TAAAAAAAACCAAGTGTCACCCAGGGTGGAGATCCCCGGAAAAGGAAAAG GGGGGTTTTTTATTCGAAACGGGGAAAACTTTCACAAAATTTTGGAAGAA TCCCCTTTAATGTTTTCTTTTCAAAAGGGGGTAAAAAAACCACCTTTAAA AAGAAGTCTACCTTGGGAAAAAATAATTTTTGGGAAAATTTAAAAATTGA
Re^3: splitting fasta file into individual fasta files
by citromatik (Curate) on May 07, 2009 at 09:13 UTC

    You can use Tie::File::AnyData::Bio::Fasta to facilitate this task. This script reads a multi-fasta file and splits it in multiple single-fasta files:

    use strict; use warnings; use Tie::File::AnyData::Bio::Fasta; tie my @in,'Tie::File::AnyData::Bio::Fasta', shift @ARGV, or die $!; my $n = 0; for my $seq (@in){ tie my @out, 'Tie::File::AnyData::Bio::Fasta', "$n.fa" or die $!; $n++; @out = ($seq); untie @out; } untie @in;

    citromatik

      citromatik,
      Thanks. I have installed the files into 'C:\Perl\site\lib' and
      'C:\Perl\lib' on my windows system, but the error keeps stating that
      it 'Can't locate Tie/File/AnyData/Bio/Fasta.pm in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .)'
      What could I be doing wrong?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://762415]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2015-07-29 07:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (261 votes), past polls