This is a fasta-formatted sequence file, right? You're best bet for doing this is might be the BioPerl library. They have both a regular fasta-parser as well as a special indexed parser for genome-sized files. The latter works really well if you need to get subsequences from a file. For instance, I use the indexed-version when I want to extract all mRNA sequences from a single chromosome.
Here is the code for using the indexed-version. It processes every mRNA on every chromosome in a typical eukaryotic genome in about 30 minutes on a P4.
# set up chromosome sequence object
my $file = $chromosome.'.fa';
my $seqio = Bio::SeqIO->new(
-format => 'largefasta',
-file => $file
);
# set up range parameters
my $strand = '+';
my $seq_start = 1_000_000;
my $seq_end = 2_000_000;
my $downstream= 10_000;
# handle sequences on both strands
if ($strand eq '+') {
$seq_start = $tx_start - $upstream;
$seq_end = $tx_start + $downstream;
if ($seq_start < 0) { next(); }
if ($seq_end > $length) { next(); }
$seq_mRNA = $seq_chr->trunc($seq_start,$seq_end);
}
elsif ($strand eq '-') {
$seq_start = $tx_end + $upstream;
$seq_end = $tx_end - $downstream;
if ($seq_end < 0) { next(); }
if ($seq_start > $length) { next(); }
$seq_mRNA = $seq_chr->trunc($seq_end,$seq_start)->revcom();
}
else {
warn "No strand: $strand!\n";
next();
}
# write seq to a file
my $outfile = $mRNA.'.fasta';
open(OUT, '>', $outfile);
print OUT ">$mRNA\n", $seq_mRNA->seq();
close(OUT);
And there are always the usual caveats with BioPerl -- great user support group, fairly big initial learning curve, highly object-oriented library....
-Tats