Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

simple perl script to trim a FASTA file

by RabidMortal (Initiate)
on Apr 08, 2010 at 22:02 UTC ( #833644=perlquestion: print w/ replies, xml ) Need Help??
RabidMortal has asked for the wisdom of the Perl Monks concerning the following question:

hello,

i need help finding/writing a simple script to edit a very large FASTA (text) file.

the text format of the FASTA file is simple:

>HWI-EAS158_40_3_1_46_535
GTGAATGCGTGATACAGGAATGTTCGTTGTGACCAT
>HWI-EAS158_40_3_1_47_579
AAAGTGAATGCGTGATACAGGAATGTTCGTTGTGAC
>HWI-EAS158_40_3_1_46_731
GTGTCATGCGTGATACAGGAATGTTCGTTGTGAAAA

each file has 6000000 lines, all with the exact same format. i need a script that will go through and trim off x nucleotides from the beginning, and y nucleotides from the end, of each and every sequence in the file. so, the script should not touch anything on the lines beginning with ">"

i would really really appreciate any help.

thank you.

Comment on simple perl script to trim a FASTA file
Re: simple perl script to trim a FASTA file
by almut (Canon) on Apr 08, 2010 at 22:10 UTC
    trim off x nucleotides from the beginning, and y nucleotides from the end
    #!/usr/bin/perl while (<DATA>) { s/^(?!>)...(.*)..../$1/; # ^^^ ^^^^ # x=3 y=4 print; } __DATA__ >HWI-EAS158_40_3_1_46_535 GTGAATGCGTGATACAGGAATGTTCGTTGTGACCAT >HWI-EAS158_40_3_1_47_579 AAAGTGAATGCGTGATACAGGAATGTTCGTTGTGAC >HWI-EAS158_40_3_1_46_731 GTGTCATGCGTGATACAGGAATGTTCGTTGTGAAAA

    For longer stretches you can also write .{N} (with N being the number of nucleotides to trim).

Re: simple perl script to trim a FASTA file
by BrowserUk (Pope) on Apr 08, 2010 at 23:01 UTC

    Generally, FASTA files contain multiple sequence lines (wrapped at n? chars), per header line, which means your example could be misleading. If that is the case, then something like this may be needed:

    #! perl -slw use strict; my( $x, $y ) = ( 3, 5 ); while( <DATA> ) { chomp; print ">$_"; local $/ = '>'; $_ = <DATA>; chomp; tr[\n][]d; print for unpack '(A36)*', substr $_, $x, length() - $x - $y; } __DATA__ >HWI-EAS158_40_3_1_46_535 GTGAATGCGTGATACAGGAATGTTCGTTGTGACCAT >HWI-EAS158_40_3_1_47_579 AAAGTGAATGCGTGATACAGGAATGTTCGTTGTGAC AAAGTGAATGCGTGATACAGGAATGTTCGTTGTGAC AAAGTGAATGCGTGATACAGGAATGTTCGTTGTGAC AAAGTGAATGCGTGATACAGGAATGTTCGTTGTGAC >HWI-EAS158_40_3_1_46_731 GTGTCATGCGTGATACAGGAATGTTCGTTGTGAAAA GTGTCATGCGTGATACAGGAATGTTCGTTGTGAAAA GTGTCATGCGTGATACAGGAATGTTCGTTGTGAAAA

    Produces:

    c:\test>junk46 >>HWI-EAS158_40_3_1_46_535 AATGCGTGATACAGGAATGTTCGTTGTG >HWI-EAS158_40_3_1_47_579 GTGAATGCGTGATACAGGAATGTTCGTTGTGACAAA GTGAATGCGTGATACAGGAATGTTCGTTGTGACAAA GTGAATGCGTGATACAGGAATGTTCGTTGTGACAAA GTGAATGCGTGATACAGGAATGTTCGTT >HWI-EAS158_40_3_1_46_731 TCATGCGTGATACAGGAATGTTCGTTGTGAAAAGTG TCATGCGTGATACAGGAATGTTCGTTGTGAAAAGTG TCATGCGTGATACAGGAATGTTCGTTGT

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: simple perl script to trim a FASTA file
by RabidMortal (Initiate) on Apr 09, 2010 at 14:59 UTC
    Thank you thank you.

    So far, I have not been able to get either script to run. I keep getting a missing semicolon error--I am not sure what that means but that seems like something I can figure out on my own.

    What I cannot figure out is if there an easy way I can get these scripts to run on a specified FASTA file? Again, my files are very very large (gigabytes) and I'd prefer not to edit them manually...

    Again, I really appreciate the help.

      You should indeed not edit the sequence files.

      The previous answers (by almut and BrowserUK) just use __DATA__ because that makes it easy to show as compact example. And to use the examples they'll need a little editing.

      For instance, in almut's script, replace the line:

      while( <DATA> ) {
      with
      while( <> ) {

      and invoke the script as:

      perl almuts_answer.pl sequences.fa > sequences_cleaned.fa

      Replacing <DATA> with <> means it will not read the __DATA__ section of the program-file itself, but instead read from STDIN (aka standard input).

      hth

Re: simple perl script to trim a FASTA file
by RabidMortal (Initiate) on Apr 09, 2010 at 16:18 UTC
    Thank you thank you thank you!!!

    I really appreciate the help and both scripts work great.

    This is a great way to end the week!

    :)
Re: simple perl script to trim a FASTA file
by silver_steve (Initiate) on Jan 26, 2012 at 20:15 UTC
    Could someone please explain what the (A36) does in BrowserUk's code? Thank you.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://833644]
Approved by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2014-11-29 06:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (203 votes), past polls