Hello reebee3,
Here’s one way to approach this task:
#! perl
use strict;
use warnings;
my (%seqs, $id, $dna);
while (my $line = <>)
{
chomp $line;
if ($line =~ / ^ > (.+) /x)
{
$seqs{$id} = $dna if defined $id;
$id = $1;
$dna = '';
}
else
{
$dna .= $line;
}
}
$seqs{$id} = $dna if defined $id;
for my $key (sort { length $seqs{$a} <=>
length $seqs{$b} } keys %seqs)
{
printf "%s:%d\n", $key, length $seqs{$key};
}
Output:
15:55 >perl 1406_SoPW.pl data.fas
SequenceID|9876_Gene2:15
SequenceID|1234_Gene1:16
15:55 >
Notes:
- The above code contains no error checking! In particular, it doesn’t check that the fasta file format is valid. You say “I do not want to use BioPerl”, but a dedicated module is usually better and safer than hand-written code.
- The special filehandle <> reads from the file(s) specified on the command line (or from standard input if no files are specified). For other approaches, see perlopentut#Opening-Text-Files-for-Reading.
- You say you want to sort the data by length, but you don’t specify the sort order. I have assumed increasing order. If you want decreasing order instead, reverse the occurrences of $a and $b: sort { length $seqs{$b} <=> length $seqs{$a} }
Hope that helps,