Base sequence length in fasta format file

lolly has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i am stuck. i have a file which contains various stretches of DNA in fasta format e.g

>12343
ACTAGCTAGCTGATGTCGTCCTACACAT

>4567465
ACTAGCGGCGTGCCTAGTACGTACTTCTTG
[download]

in which each record is separated by a newline. I am trying to write code that will return all sequences containing more than 250 bases. But i think i am getting confused with nested loops etc because although my program runs fine it still returns everything. can anyone help?? this is my code:


#! /usr/local/bin/perl -w

use strict;

open (INPUT, $ARGV[0]) or die "unable to open file";

my $sequence;
my $count =0;
my $base = ("A"|"C"|"G"|"T");
my @array;
my $ideal = 250;

while (<INPUT>)
{
$sequence = $_;
@array = ();
@array = split (/\n/, $sequence);

foreach $base ($sequence)
{
$count++;

if ($count > $ideal)
{
print "$base\n";
}
}
}
[download]

i am new to perl and would appreciate any help. lolly

Comment on Base sequence length in fasta format file Select or Download Code

Replies are listed 'Best First'.
Re: simple but stuck by Fletch (Bishop) on Apr 25, 2002 at 14:32 UTC
`my $base = ("A"\|"C"\|"G"\|"T");` [download] Erm, this is giving you the bitwise or of the four characters. This probably isn't a useful value. Of course you never actually use it anyhow. `while (<INPUT>) { $sequence = $_ @array = () @array = split (/\n/, $sequence );` [download] If you want to use another name for the input line, declare it in the while itself (`while( defined( my $sequence = <INPUT> ) )`). Also it makes little sense to empty `@array` right before you assign to it. As for splitting on newlines, that's not really going to work since you haven't changed `$/` and you'll only be getting back a line at a time. You probably want to do `$/ = ""` to get paragraph mode; see `perldoc perlvar`.	[reply] [d/l] [select]
Re: simple but stuck by demerphq (Chancellor) on Apr 25, 2002 at 15:00 UTC
Hi, hopefully you are actually indenting your work and that it got unindented by not using code tags. `#! /usr/local/bin/perl -w use strict; open (INPUT, $ARGV[0]) or die "unable to open file";` [download] This is precisely what the perl idiom `while (<>){}` is for `my $sequence; my $count =0; my $base = ("A"\|"C"\|"G"\|"T");` [download] This probably doesnt do what you want. $base ends up being the binary or of the string A C G and T. Which happens to be "W". :-) `my @array; my $ideal = 250; while (<INPUT>) {` [download] As i said before the while loop can be rewritten much cleaner. Also you are reading in a line at a time because you havent messed with $/ `$sequence = $_;` [download] No need to save $_ unless intend to do something that will smash it. Which afact you arent. `@array = (); @array = split (/\n/, $sequence);` [download] You read in a line, and then you split by \n. That wont help as there will only be 1 (or 0) but not more. So everything from there isnt going to go well. :-) Maybe this does what you want, or least points you in the right direction? `while (<>) { chomp; # lose potential newlines at the end. (only needed for portab +ility) print $_,"\n" if length $_ > 250; # print out any sequence longer t +han 250chars }` [download] One of the cool things about the while (<>) idiom is that it will allow you to do this: `# find every line over 250 chars from three different files, print the +m to STDOUT over_250 file.1 file.2 file.3 # find lines over 250 chars from STDIN over_250 <file.1` [download] HTH Yves / DeMerphq --- Writing a good benchmark isnt as easy as it might look.	[reply] [d/l] [select]
Re: simple but stuck by perlplexer (Hermit) on Apr 25, 2002 at 14:30 UTC
Perhaps you could explain what a "base" is? The code makes little sense... `my $base = ("A"\|"C"\|"G"\|"T"); # what is this? $base is now 'W' #... foreach $base ($sequence){ # $sequence is a scalar, not an array #... }` [download] --perlplexer	[reply] [d/l]
Re: simple but stuck by VSarkiss (Monsignor) on Apr 25, 2002 at 14:54 UTC
As others have pointed out, this: `my $base = ("A"\|"C"\|"G"\|"T");`doesn't do what you probably intended. I'm not completely clear on what you're trying to do, but I think it's something like this: "if a line contains more than 250 bases, print it out" (where 250 can be changed easily). You don't need to count every single base: you can check that there's only A, T, C, or G in the line, and that it's longer than 250 characters. Something like this should do the trick (Note, this is untested): `while (<INPUT>) { print if /^[ATCG]+$/ && length >= $ideal; }` [download] If the number of bases you're looking for is fixed, you can put it in the regular expression: `print if /^[ATCG]{250,}$/;`But the first example I gave is probably clearer anyway. If that's not the problem you're trying to solve, please provide more information. HTH.	[reply] [d/l] [select]
Re: simple but stuck by stephen (Priest) on Apr 25, 2002 at 15:27 UTC
If I'm reading what you're trying to do correctly, you're trying to read sequences from a FastA format file, count the length of each sequence, and print out the ones that are more than 250 bases long. If possible, I'd suggest using the bioperl module. That does all of the work of reading fasta format, and so will make your work easier. If you can't install modules yourself, perhaps you can get your system administrator to install it for you. Here's an example of what I think you're trying to do, using the BioPerl modules. `#! /usr/local/bin/perl -w ### NOT TESTED CODE-- I'm not a bio guy, and don't have BioPerl... use strict; use constant IDEAL => 250; use Bio::SeqIO; MAIN: { my $seq_file = Bio::SeqIO->new( -file => $ARGV[0], -format => 'fasta' ) or die "Couldn't open $ARGV[0]: $!; stopped"; while ( my $seq = $seq_file->next_seq() ) { if ( $seq->length() > IDEAL ) { print $seq->seq(), "\n"; } } }` [download] stephen Update: Removed redundant length() call.	[reply] [d/l]
Re: simple but stuck by andye (Curate) on Apr 25, 2002 at 14:45 UTC
Hi lolly, I'm guessing here, but it seems from your code like maybe what you're trying to do is return all lines which are longer than 250 characters? If so, try something like this: `while (<INPUT>) { chomp; print if length($_) > 250; }` [download] If that's not what you want, then could you explain again what the problem is? (particularly, what's a base?) Anyway, hope that helps. andy.	[reply] [d/l]
Skip chomp? by RMGir (Prior) on Apr 25, 2002 at 14:50 UTC
`s/chomp;//;` You probably don't want to chomp, or all the printed bases will come out on the same line... -- Mike	[reply] [d/l]
Re: Skip chomp? by andye (Curate) on Apr 25, 2002 at 15:05 UTC
RMGir, you have a good point. `while (<INPUT>) { print if length($_) > 251; }` [download] lolly, note that the number has changed to 251 (250 for the number of bases, plus one for the newline). Andy. Update: Or of course: `while (<INPUT>) { chomp; print $_."\n" if length($_) > 250; }` [download]	[reply] [d/l] [select]
Re: Re: Skip chomp? by demerphq (Chancellor) on Apr 25, 2002 at 15:12 UTC
Re: Skip chomp? by demerphq (Chancellor) on Apr 25, 2002 at 15:06 UTC
Actually if portability is a concern than the chomp is vital, but no matter what hes going to have to factor the length of the /n into his length calculation. You are correct however in that he needs to put the newline back on once its been removed. Yves / DeMerphq --- Writing a good benchmark isnt as easy as it might look.	[reply]
NL len not an issue, at least for Win32 by RMGir (Prior) on Apr 25, 2002 at 15:15 UTC
Re: NL len not an issue, at least for Win32 by demerphq (Chancellor) on Apr 25, 2002 at 15:20 UTC
Re: Re: simple but stuck by lolly (Novice) on Apr 25, 2002 at 14:54 UTC
sorry, DNA is composed of four bases A, C, T, G. i will try what you have suggested. lolly	[reply]
Re: Re: Re: simple but stuck by andye (Curate) on Apr 25, 2002 at 15:08 UTC
Can the number on the line before the string of bases ever be longer than 250 digits? If yes, then you need something more sophisticated than my suggestion. Come back to us if it doesn't work. Andy.	[reply]
Re: simple but stuck by Putzfrau (Beadle) on Apr 25, 2002 at 15:09 UTC
By base i assume you mean each occurence of the four letters representing A is for adenine, G is for guanine, C is for cytosine, T is for thymine, and so you would want to count each of those occurences? so you could `while(<input>) { $sequence = $_; chomp($sequence); $count=length($sequence); if ($count >= $ideal) { print "$sequence\n"; # or whatever you want here } }` [download] Hope this is what you had in mind and that it helps "when a new client is created, we have to kill all the children..." --Sams Teach yourself Perl 5	[reply] [d/l]
Re: simple but stuck by YuckFoo (Abbot) on Apr 25, 2002 at 16:29 UTC
The translate operator with the same search and replace list will return a count of the number of matched characters. Read all about it in the perlop manpage. `while ($line = <INPUT>) { if ($line =~ tr/ACGT/ACGT/ > 250) { print $line; } }` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom