Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Base sequence length in fasta format file

by lolly (Novice)
on Apr 25, 2002 at 14:19 UTC ( #161961=perlquestion: print w/replies, xml ) Need Help??

lolly has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i am stuck. i have a file which contains various stretches of DNA in fasta format e.g
in which each record is separated by a newline. I am trying to write code that will return all sequences containing more than 250 bases. But i think i am getting confused with nested loops etc because although my program runs fine it still returns everything. can anyone help?? this is my code:
#! /usr/local/bin/perl -w use strict; open (INPUT, $ARGV[0]) or die "unable to open file"; my $sequence; my $count =0; my $base = ("A"|"C"|"G"|"T"); my @array; my $ideal = 250; while (<INPUT>) { $sequence = $_; @array = (); @array = split (/\n/, $sequence); foreach $base ($sequence) { $count++; if ($count > $ideal) { print "$base\n"; } } }
i am new to perl and would appreciate any help. lolly

Replies are listed 'Best First'.
Re: simple but stuck
by Fletch (Bishop) on Apr 25, 2002 at 14:32 UTC
    my $base = ("A"|"C"|"G"|"T");

    Erm, this is giving you the bitwise or of the four characters. This probably isn't a useful value. Of course you never actually use it anyhow.

    while (<INPUT>) { $sequence = $_ @array = () @array = split (/\n/, $sequence );

    If you want to use another name for the input line, declare it in the while itself (while( defined( my $sequence = <INPUT> ) )). Also it makes little sense to empty @array right before you assign to it. As for splitting on newlines, that's not really going to work since you haven't changed $/ and you'll only be getting back a line at a time. You probably want to do $/ = "" to get paragraph mode; see perldoc perlvar.

Re: simple but stuck
by demerphq (Chancellor) on Apr 25, 2002 at 15:00 UTC
    Hi, hopefully you are actually indenting your work and that it got unindented by not using code tags.
    #! /usr/local/bin/perl -w use strict; open (INPUT, $ARGV[0]) or die "unable to open file";
    This is precisely what the perl idiom while (<>){} is for
    my $sequence; my $count =0; my $base = ("A"|"C"|"G"|"T");
    This probably doesnt do what you want. $base ends up being the binary or of the string A C G and T. Which happens to be "W". :-)
    my @array; my $ideal = 250; while (<INPUT>) {
    As i said before the while loop can be rewritten much cleaner. Also you are reading in a line at a time because you havent messed with $/
    $sequence = $_;
    No need to save $_ unless intend to do something that will smash it. Which afact you arent.
    @array = (); @array = split (/\n/, $sequence);
    You read in a line, and then you split by \n. That wont help as there will only be 1 (or 0) but not more. So everything from there isnt going to go well. :-)

    Maybe this does what you want, or least points you in the right direction?

    while (<>) { chomp; # lose potential newlines at the end. (only needed for portab +ility) print $_,"\n" if length $_ > 250; # print out any sequence longer t +han 250chars }
    One of the cool things about the while (<>) idiom is that it will allow you to do this:
    # find every line over 250 chars from three different files, print the +m to STDOUT over_250 file.1 file.2 file.3 # find lines over 250 chars from STDIN over_250 <file.1

    Yves / DeMerphq
    Writing a good benchmark isnt as easy as it might look.

Re: simple but stuck
by perlplexer (Hermit) on Apr 25, 2002 at 14:30 UTC
    Perhaps you could explain what a "base" is?
    The code makes little sense...
    my $base = ("A"|"C"|"G"|"T"); # what is this? $base is now 'W' #... foreach $base ($sequence){ # $sequence is a scalar, not an array #... }
Re: simple but stuck
by VSarkiss (Monsignor) on Apr 25, 2002 at 14:54 UTC

    As others have pointed out, this: my $base = ("A"|"C"|"G"|"T");doesn't do what you probably intended.

    I'm not completely clear on what you're trying to do, but I think it's something like this: "if a line contains more than 250 bases, print it out" (where 250 can be changed easily). You don't need to count every single base: you can check that there's only A, T, C, or G in the line, and that it's longer than 250 characters. Something like this should do the trick (Note, this is untested):

    while (<INPUT>) { print if /^[ATCG]+$/ && length >= $ideal; }
    If the number of bases you're looking for is fixed, you can put it in the regular expression:     print if /^[ATCG]{250,}$/;But the first example I gave is probably clearer anyway.

    If that's not the problem you're trying to solve, please provide more information.


Re: simple but stuck
by stephen (Priest) on Apr 25, 2002 at 15:27 UTC

    If I'm reading what you're trying to do correctly, you're trying to read sequences from a FastA format file, count the length of each sequence, and print out the ones that are more than 250 bases long.

    If possible, I'd suggest using the bioperl module. That does all of the work of reading fasta format, and so will make your work easier. If you can't install modules yourself, perhaps you can get your system administrator to install it for you.

    Here's an example of what I think you're trying to do, using the BioPerl modules.

    #! /usr/local/bin/perl -w ### NOT TESTED CODE-- I'm not a bio guy, and don't have BioPerl... use strict; use constant IDEAL => 250; use Bio::SeqIO; MAIN: { my $seq_file = Bio::SeqIO->new( -file => $ARGV[0], -format => 'fasta' ) or die "Couldn't open $ARGV[0]: $!; stopped"; while ( my $seq = $seq_file->next_seq() ) { if ( $seq->length() > IDEAL ) { print $seq->seq(), "\n"; } } }


    Update: Removed redundant length() call.

Re: simple but stuck
by andye (Curate) on Apr 25, 2002 at 14:45 UTC
    Hi lolly,

    I'm guessing here, but it seems from your code like maybe what you're trying to do is return all lines which are longer than 250 characters? If so, try something like this:

    while (<INPUT>) { chomp; print if length($_) > 250; }
    If that's not what you want, then could you explain again what the problem is? (particularly, what's a base?)

    Anyway, hope that helps.



      You probably don't want to chomp, or all the printed bases will come out on the same line...

        RMGir, you have a good point.
        while (<INPUT>) { print if length($_) > 251; }

        lolly, note that the number has changed to 251 (250 for the number of bases, plus one for the newline).


        Update: Or of course:

        while (<INPUT>) { chomp; print $_."\n" if length($_) > 250; }
        Actually if portability is a concern than the chomp is vital, but no matter what hes going to have to factor the length of the /n into his length calculation.

        You are correct however in that he needs to put the newline back on once its been removed.

        Yves / DeMerphq
        Writing a good benchmark isnt as easy as it might look.

      sorry, DNA is composed of four bases A, C, T, G. i will try what you have suggested. lolly
        Can the number on the line before the string of bases ever be longer than 250 digits? If yes, then you need something more sophisticated than my suggestion. Come back to us if it doesn't work. Andy.
Re: simple but stuck
by Putzfrau (Beadle) on Apr 25, 2002 at 15:09 UTC
    By base i assume you mean each occurence of the four letters representing A is for adenine, G is for guanine, C is for cytosine, T is for thymine, and so you would want to count each of those occurences? so you could
    while(<input>) { $sequence = $_; chomp($sequence); $count=length($sequence); if ($count >= $ideal) { print "$sequence\n"; # or whatever you want here } }

    Hope this is what you had in mind and that it helps
    "when a new client is created, we have to kill all the children..." --Sams Teach yourself Perl 5
Re: simple but stuck
by YuckFoo (Abbot) on Apr 25, 2002 at 16:29 UTC
    The translate operator with the same search and replace list will return a count of the number of matched characters. Read all about it in the perlop manpage.

    while ($line = <INPUT>) { if ($line =~ tr/ACGT/ACGT/ > 250) { print $line; } }

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://161961]
Approved by VSarkiss
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2022-05-23 11:57 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (82 votes). Check out past polls.