Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

put every sequence of a file in a different output file

by bingalee (Acolyte)
on Jun 13, 2013 at 14:59 UTC ( [id://1038764]=perlquestion: print w/replies, xml ) Need Help??

bingalee has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge file with many sequences. I need to separate them into different files.How do I go about making a perl script for it? .Thanks in advcance

Here's what my file looks like,

XLOC_000039 >chr1:983051-985037 CATGACTTTGTCGGAATTATGTTACTGCTCATTATCAATTCCACCATTAGCTTCATAGAG GAAAACAATGCCGGGAATGCTGCTGCTGCGCTTATGGCCCGCCTCGCACCAAAATCCAAG GTAAGCCCCACACCCTACTTACCACTCCTTTTTCTTCTCAATACTGCTTTTCATCATGTT ACACTCATTTTCTAGGTTTTACGTGATGGAACCTGGAGTGAAATGGACGCATCTTTGTTG GTGCCCGGTGACATAATCAGCATTAAACTTGGAGACATCATTCCGGCAGATGCGCGTCTT CTCGAGGGAGATCCGCTGAAAATTGACCAGGTCTTTCTTGTGTCTCAATCATAGTGTTCT TGGTAGAGCGGAAAAAAAAATATTCTGATATGAAAATTACATGAGACACTAAAACACATA XLOC_000456 >chr1:12600284-12601781 CAACAATCTCTGATGATGCGGCAGGGCCTTGCTCGCGGGGCGTGGTGCTACCTCGAGGAT GAGTTCCTTGGCCAAAGGGAATCCCGGGCGCTTCTACTTGAGACAAAATTCCGCAACTTC CGCCAAGAGTCCTTGAGCATCACTGACTACTGCCGCCAGCTTGAGTCAATGGCGGCATCC CTTGCCGGTTTCGGCGATCCCATCGGCGATAGGCAGATGGTGCTCACGCTCCTTCGTGGC CTCGGCGGCAAGTTCCGTCACATGGTGTCCATCCTCAAGATGCACCAGCCGTTCCCCACG TTCGCAGAGGCTCGTGCGCACCTGCTGTTGGAGGAGCTGGAAATCGACGCACGACCTCCA TCACCGCCATCGGCACTTGTTGCTGCAGCGCCGCGGCATGCGACTCCGGGGGCCCCAGTA

So I thought I could split each of them into arrays but I kinda dont know what to do after that.

this is what I got

#usr/bin/perl -w open(IN,"/home/datasets/maize/extracted_sequences.gtf"); mkdir sequence1; while($seq=<IN>) { @file=split(/\r/,$seq); open(OUT, ">sequences1/$file[0].txt"); print OUT ">$file[0]"; } close(IN); close(OUT);

Replies are listed 'Best First'.
Re: put every sequence of a file in a different output file
by kennethk (Abbot) on Jun 13, 2013 at 15:09 UTC
    How do you define a 'sequence'? How can you tell when you've moved from one to another? Are the sequences listed in series, or are they mixed together in some way? And while I'm at it, what have you tried and what didn't work? See How do I post a question effectively?.

    In terms of transforming one file to a series of files, your code might look something like:

    my $fh; my $i = 0; while (my $line = <$in>) { if (new_sequence_test($line)) { $i++; open $fh, '>', "seq_$i" or die "Open fail seq_$i: $!\n"; } print $fh $line; }

    Update: Now that you've added input and code, I can say a little more. First, while not strictly necessary, strict will help you catch a lot of potential issues and will make variable scoping intent more obvious. See Use strict warnings and diagnostics or die.

    When you say while($seq=<IN>), you read in one line of your file. Your split assumes you are slurping your whole file. And especially if file is 'huge', you probably don't want to store it all in memory. Assuming your sequences are split by empty lines, you could modify my already posted code to accomplish your task.

    my $fh; my $i = 0; while (my $line = <$in>) { if (!$fh or $line !~ /\S/) { $i++; open $fh, '>', "seq_$i" or die "Open fail seq_$i: $!\n"; } print $fh $line; }

    You could also do this very simply by modifying $/, but I suspect that solution would be unclear to you.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Thank you, Ill try your code. Mine didnt work, it put every line of the sequence into a new file, I guess i went wrong at the split.Can you tell me what should I have done instead-just so that I know what I did wrong

Re: put every sequence of a file in a different output file
by davido (Cardinal) on Jun 13, 2013 at 15:13 UTC

    There's no generalized tool for separating a huge file with many sequences into different files. ...well, actually there is; it's Perl. But you have to think through your problem and use programming to craft a solution.

    So let's get into specifics; if you were to do this with pencil and paper, how would you identify where one sequence ends and another begins? Have you started a solution yet? What can you show us?


    Dave

Re: put every sequence of a file in a different output file
by 2teez (Vicar) on Jun 13, 2013 at 16:54 UTC

    Hi bingalee,
    I strongly guess you want to save each of the "sequence" starting from XLOC_.. till the space line.
    If so,
    the following can give you a head up: It saves each "sequence", into a different named file. (am using only the OP data here .... Please)

    use warnings; use strict; my $fh; while (<DATA>) { chomp; if (/(XLOC_\d+$)/) { open $fh, '>', "$1.txt" or die $!; print $fh $_, $/; } else { print $fh $_, $/; } } __DATA__ XLOC_000039 >chr1:983051-985037 CATGACTTTGTCGGAATTATGTTACTGCTCATTATCAATTCCACCATTAGCTTCATAGAG GAAAACAATGCCGGGAATGCTGCTGCTGCGCTTATGGCCCGCCTCGCACCAAAATCCAAG GTAAGCCCCACACCCTACTTACCACTCCTTTTTCTTCTCAATACTGCTTTTCATCATGTT ACACTCATTTTCTAGGTTTTACGTGATGGAACCTGGAGTGAAATGGACGCATCTTTGTTG GTGCCCGGTGACATAATCAGCATTAAACTTGGAGACATCATTCCGGCAGATGCGCGTCTT CTCGAGGGAGATCCGCTGAAAATTGACCAGGTCTTTCTTGTGTCTCAATCATAGTGTTCT TGGTAGAGCGGAAAAAAAAATATTCTGATATGAAAATTACATGAGACACTAAAACACATA XLOC_000456 >chr1:12600284-12601781 CAACAATCTCTGATGATGCGGCAGGGCCTTGCTCGCGGGGCGTGGTGCTACCTCGAGGAT GAGTTCCTTGGCCAAAGGGAATCCCGGGCGCTTCTACTTGAGACAAAATTCCGCAACTTC CGCCAAGAGTCCTTGAGCATCACTGACTACTGCCGCCAGCTTGAGTCAATGGCGGCATCC CTTGCCGGTTTCGGCGATCCCATCGGCGATAGGCAGATGGTGCTCACGCTCCTTCGTGGC CTCGGCGGCAAGTTCCGTCACATGGTGTCCATCCTCAAGATGCACCAGCCGTTCCCCACG TTCGCAGAGGCTCGTGCGCACCTGCTGTTGGAGGAGCTGGAAATCGACGCACGACCTCCA TCACCGCCATCGGCACTTGTTGCTGCAGCGCCGCGGCATGCGACTCCGGGGGCCCCAGTA
    The above is how far I can make of your question.
    I'm sorry, if I get you wrong.
    Please pay close attention to the comments before this.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1038764]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-04-26 08:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found