Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

G'day Sofie,

I normally handle fasta files by making the record terminator the '>'; note that the first record is a zero-length string which you'll typically want to discard. In your particular case here, you can split that record on a newline: the sequence will be the second element. Look at the code below and then some explanatory notes at the end.

#!/usr/bin/env perl use strict; use warnings; use constant MIN_LENGTH => 10; my $long_seqs = 0; { local $/ = '>'; while (<DATA>) { next if $. == 1; my $seq = (split /\n/)[1]; next if MIN_LENGTH > length $seq; ++$long_seqs; print $seq, "\n"; } } if (not $long_seqs) { print 'No sequence is over ', MIN_LENGTH-1, "\n"; } __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC > sequence length = 9 -- should be discarded ABCDEFGHI > sequence length = 10 -- should be kept ABCDEFGHIJ > sequence length = 11 -- should be kept ABCDEFGHIJK

Sample run:

$ ./pm_11113575_fasta_extract.pl GGAGGTCTTTAGCTTTAGGGAAACCC ABCDEFGHIJ ABCDEFGHIJK

Changing the MIN_LENGTH constant to 100:

$ ./pm_11113575_fasta_extract.pl No sequence is over 99

Notes:

  • See "perlvar: Variables related to filehandles" for an explanation of '$/', '$.', and why I've put the local and while in an anonymous block.
  • See "local" for more about that function; that page has links to further information.
  • Purely for demonstration purposes, I've put the data under __DATA__ and read it with the DATA filehandle (see "perldata: Special Literals" for more about those). Your code will have something more like '(<$fh>)'.
  • See "open" for a better way to open your files: using lexical filehandles and the 3-argument form of that command.
  • Your error reporting leaves much to be desired: which file couldn't open? why couldn't it open? It might be fine for a tiny test script like this; however, that inevitably gets expanded and built upon — then you can end up with bugs that are hard to track down (exacerbated by the use of package variables for filehandles). Take a look at the "autodie" pragma: with this you don't have to (remember to) write 'or die "..."' (no point doing extra work when Perl can do it for you) and you also get much better reporting of which file had what problem.
  • Note that I extended your test data — yes, I know those aren't valid sequences but they were handier for quickly checking results than random strings of nucleotides. Off-by-one errors are extremely common and very easy to make: you actually have one in the code you posted ("... No sequence is over ...") — compare with my code to spot the problem. In this case, using test (pseudo-)sequences that are 'MIN_LENGTH-1', 'MIN_LENGTH' and 'MIN_LENGTH+1', you can quickly catch problems like that.
  • Don't use the '-w' switch: see "perlrun: -w" for the reason. Always use the strict and warnings pragmata, as I've shown in my code: see "perlintro: Safety net" for the reason.

— Ken


In reply to Re: Size of sequences in fastafile by kcott
in thread Size of sequences in fastafile by Sofie

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-16 05:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found