note
abualiga
<p>Hi,
out of curiosity, as I work with fasta files a bit, why not use the Bio::DB::Fasta module from BioPerl for indexed access to any number of fasta files. Your example is precisely what this module is for, unless you're exercising OO programming. I used your sample data for the code below, if interested.
</p>
<p> update: Initially, creating the db may take a moment depending on size of input files, but it works pretty quick every time it's accessed after that.
</p>
<code>
#!/usr/bin/perl
use Modern::Perl '2011';
use autodie;
use Bio::DB::Fasta;
die "Usage: $0 <fastaFile> <queryFile>\n" unless my( $fastaFile, $queryFile ) = @ARGV;
open my $fh, '<', $queryFile;
chomp( my @queries = <$fh> );
close $fh;
# trim header
@queries = map{ local $_ = $_; s/^>(\S+)\s+\d+/$1/; $_ } @queries;
# create indexed database from fasta file
my $db = Bio::DB::Fasta->new( $fastaFile );
open my $fh2, '>', 'querySeqs.fa';
for my $query ( @queries ) {
my $seqheader = $db->header( $query );
my $seqdna = $db->seq( $seqheader );
say {$fh2} ">$seqheader";
print {$fh2} format_sequence( $seqdna, 60 );
}
close $fh2;
sub format_sequence {
my($sequence, $length) = @_; # sequence and length of line
open my $string_fh, '>', \ my $string; # print sequence to string
for( my $pos = 0; $pos < length( $sequence ); $pos += $length ) {
print {$string_fh} substr( $sequence, $pos, $length ), "\n";
}
$string;
}
output in file querySeqs.fa
>Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent=AT1G11900.len361.i1.1_111
GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTT
TTCTGATTGTTACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGA
AATTTCGAACCATTGCTGTTCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACG
CTTCAGCTAGTGTTTGATTACGAAGTTTTCTGATTAAATGTGTGAGTTTTTTTGTAGTCA
TCTCGAAAACTGAGAAATCCATTTTTATAGATTTACATTGTTCATAGTTATATGTGGAAG
TTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGTTTATTGGCATTGCA
G
>Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent=AT1G76660.len490.i1.2_394
GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCT
CTGTCTCAGTTCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTT
TTTTCCTACTTAGATCTGACTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTAT
GTTGAATCTCTCCTTTGATCATTATGTTATTGTAAAAATCCCAGCTTTATGCTAAATCGA
GCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTACATTTTTCTGAATTGCCATTCAAT
TAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTCGATGCGTGTAAGCT
ATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAATTATC
TTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGT
CCTTTTGCAG
>Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent=AT1G31420.len415.i1.13_217
GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGG
AAAGAGTGGGGTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAA
TCTTCAGTTGTTATTGTGTTATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTT
AAGCTAGAGCTGTTTTTGAGCTCTTTTGTTGATGATGATTTTGAGATTGTTGGCCGAATT
TAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGTGTATTGGTTCTTGTGAGGCAATT
TCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTTGTGTGGAGACTGAG
CTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG
</code>
1050420
1050420