drhicks has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have two files.
File 1 is a FASTA file with a header line beginning with ">" and DNA sequence on the subsequent lines until the next ">" header.
File 2 contains line after line formatted the same as the header line in file 1 except with a tab and score value added. There are only a fraction of the header lines in file2 as there are in file 1.
File1 contains roughly 900,000 header lines and the corresponding DNA sequence for each. FIle2 contains roughly 60,000 header lines.
I want a script to find matching header lines from file 2 and print the header along with the corresponding DNA sequence for that header from file1.
I wrote a script that is successful in doing what I want, but it is far too slow to be used. After ~1 hr, it had returned only ~700 correct entries out of ~60,000 total.
A small example of file1 and file 2 are shown below.
file1
>Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent
+=AT1G11900.len361.i1.1_111
GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTTTTCTGATTGT
+TACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGAAATTTCGAACCATTGCTGT
+TCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACGCTTCAGCTAGTGTTTGATTACGAAGTTT
+TCTGATTAAATGTGTGAGTTTTTTTGTAGTCATCTCGAAAACTGAGAAATCCATTTTTATAGATTTACA
+TTGTTCATAGTTATATGTGGAAGTTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGT
+TTATTGGCATTGCAG
>Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent
+=AT1G76660.len490.i1.2_394
GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCTCTGTCTCAGT
+TCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTTTTTTCCTACTTAGATCTGA
+CTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTATGTTGAATCTCTCCTTTGATCATTATGTT
+ATTGTAAAAATCCCAGCTTTATGCTAAATCGAGCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTAC
+ATTTTTCTGAATTGCCATTCAATTAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTC
+GATGCGTGTAAGCTATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAA
+TTATCTTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGTCCTT
+TTGCAG
>Chr1.ID=PAC:19657550;Name=AT1G53750.1;pacid=19657550;longest=1;Parent
+=AT1G53750.len344.i1.9_229
GTAAGCCTTCCCCTTTTAGAACCCTAAGTTTTATTGGGGTTTTCGATTTTTACTCTTCTGATTCATCGGA
+GAATTCGGATCTACACTAGATTTTAGTTACTCGAATGTGAGGGTTTCGTCTCTTTGCAAACCAATTTGA
+TGTTTCCTCCTGAGCTAGATATGTTCTTGATGAGCTTGATTTTTCTACTTGGTTCAGTTTTTTTTGCTA
+ATTACTACTTATATGAGTGAATCTGCCTCACTGTTTGAATTTATTCCAAGTGGAAATATTCATAGTCAT
+GCTTTGTTGTATCTGTTATCTCTCCATATGTTGTGTTGCTGACCTTGTTAAATCTCATTCTCTGCAG
>Chr1.ID=PAC:19650963;Name=AT1G47740.1;pacid=19650963;longest=1;Parent
+=AT1G47740.len1091.i1.4_277
GTAAGTTAATCTTCTCTTCTGAAAATTGAATTTGGTGTATCAATTCTTACATTATCTTGAAGATTCATCT
+CTGAATTTCTCAAATTTATGGGGGTTTTTTGTTTGTCGGAATTGCCGGAGAAATTGGAAAAAACGAGAT
+CTTTGAGTAAAGGGTTTGTTTAATCTTTAGTCTTTATTGCTTTCCTTAAGCTAATTTTGGCAGATCTGG
+AACATAAACCCTAGAACAAGACCAAATCAGTGTCTCCTTACTCTTAGGGATTTTAGTCTCTGTGATACC
+TTAAATGTGTTTATAAATTGACTGTGCTTAATGGGTCACATTTGATTTGCAATAAAAGTTTCACAATTC
+TTCCATTTTCAGTAATGTAAGCTCCAGTTTTCAAGATTTACTATTTTGGGGAACTAGTTAGGTTTGTAG
+GTTTATTAGATCTTAGAAACACTAATGTAATGCTGTTTGTTTGGTACTCTTTAATATTCACTATTCATC
+TTAATGTGGAATCAAATTTGCTTTTTTTGGTCAGACTCAATTAGTTGGAATGAGTTGTTGAACCATTGA
+CTTGTCTCCTAAGCTCTTACTTAATCTATCTGTATCATCTTCTCCTGTCTATTTCTCTTTATTTAATAC
+ATAACTCGTTCATGATTGCATCTGACATGGTAGCAACTCTTTGTGAACAAGACTTGATTTTATAACACT
+GTAACATGACCACACTTTTTCCTTTTGATCTCTGTATATTTGGGGAAGTAAGGATTTGGTTTAACAATG
+ATACAAAATCACAGTTTAGATGACTCAGTCTTGTCTTTTATCATTAAGATGGAAAAAATGAGACCAATC
+TTGTCTTGCTTTATTCTAAGATGCATGCTCTATTATATTTTATGCTTTTGTATATATAACTGATTGAAG
+CCGGCCAATGGAGATTGGTGCTGACTTTTTAAATGAGCTGTTGTTGTATTTCAGCCAACTAGCATAAGA
+TAAAATAAAATAGGAAAATTTTTCATTTCAGTCTCTTAAGTTCAATGAAAACATAATGTGGCACAGAGG
+TCTTTGCTTTTGTACCTTTCAGAATCTTTTATTGATTGATGATGTTTACATACAG
>Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent
+=AT1G31420.len415.i1.13_217
GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGGAAAGAGTGGG
+GTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAATCTTCAGTTGTTATTGTGT
+TATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTTAAGCTAGAGCTGTTTTTGAGCTCTTTTG
+TTGATGATGATTTTGAGATTGTTGGCCGAATTTAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGT
+GTATTGGTTCTTGTGAGGCAATTTCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTT
+GTGTGGAGACTGAGCTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG
file2
>Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent
+=AT1G11900.len361.i1.1_111 100
>Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent
+=AT1G76660.len490.i1.2_394 34
>Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent
+=AT1G31420.len415.i1.13_217 76
Here is the main script that I wrote:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use FAlite;
die "usage: $0 <fasta> <list>\n" unless @ARGV == 2;
open(FASTA, $ARGV[0]) or die;
my $fasta = new FAlite(\*FASTA);
while (my $entry = $fasta->nextEntry) {
my $gseq = $entry->seq;
my ($chrom) = $entry->def =~ /(^\S+)/;
my @a1 = $chrom;
my %hash;
$hash{$_}++ for @a1;
my $header;
open(LIST, $ARGV[1]) or die;
while (<LIST>) {
next if /^#/;
my ($header, $score) = split;
my @a2 = $header;
for my $item (@a2) {
if ($hash{$item}) {print "$chrom\n$gseq\n"};
}
}
}
close FASTA;
close LIST;
__END__
Here is the perl module that the above code uses:
package FAlite;
use strict;
sub new {
my ($class, $fh) = @_;
if (ref $fh !~ /GLOB/)
{die ref $fh, "\n", "FAlite ERROR: expect a GLOB reference\n"}
my $this = bless {};
$this->{FH} = $fh;
while(<$fh>) {last if $_ =~ /\S/} # not supposed to have blanks, b
+ut...
my $firstline = $_;
if (not defined $firstline) {warn "FAlite: Empty\n"; return $this}
if ($firstline !~ /^>/) {warn "FAlite: Not FASTA formatted\n"; ret
+urn $this}
$this->{LASTLINE} = $firstline;
chomp $this->{LASTLINE};
return $this;
}
sub nextEntry {
my ($this) = @_;
return 0 if not defined $this->{LASTLINE};
my $fh = $this->{FH};
my $def = $this->{LASTLINE};
my @seq;
my $lines_read = 0;
while(<$fh>) {
$lines_read++;
if ($_ =~ /^>/) {
$this->{LASTLINE} = $_;
chomp $this->{LASTLINE};
last;
}
push @seq, $_;
}
return 0 if $lines_read == 0;
chomp @seq;
my $entry = FAlite::Entry::new($def, \@seq);
return $entry;
}
package FAlite::Entry;
use overload '""' => 'all';
sub new {
my ($def, $seqarry) = @_;
my $this = bless {};
$this->{DEF} = $def;
$this->{SEQ} = join("", @$seqarry);
$this->{SEQ} =~ s/\s//g; # just in case more spaces
return $this;
}
sub def {shift->{DEF}}
sub seq {shift->{SEQ}}
sub all {my $e = shift; return $e->{DEF}."\n".$e->{SEQ}."\n"}
1;
While my script is working, I feel that there should be a more efficient way to search my array against my hash and print the associated header and DNA sequence.
Thanks for any help,
Derrick
Re: Searching array against hash
by BrowserUk (Patriarch) on Aug 21, 2013 at 21:37 UTC
|
I feel that there should be a more efficient way
There is. Your current process is ~O(900,000*60,000) = 54 billion.
If you reverse the logic of your lookups, by reading the list file into a hash first; you can then read the fasta file line by line and lookup the headers in the hash, which results in an O( 900,000 ) process.
Try this (untested) version which should run close to 60,000 times faster:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use FAlite;
die "usage: $0 <fasta> <list>\n" unless @ARGV == 2;
my %list;
open(LIST, $ARGV[1]) or die;
while (<LIST>) {
next if /^#/;
my ($header, $score) = split;
$list{ $header } = $score;
}
close LIST;
open(FASTA, $ARGV[0]) or die;
my $fasta = new FAlite(\*FASTA);
while (my $entry = $fasta->nextEntry) {
if( exists $list{ $entry->def } ) {
print $entry->def;
print $entry->seq;
}
}
close FASTA;
__END__
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
Wow thanks, only takes a couple seconds to finish! I had attempted to do the same thing, but could never get it working, and wasn't sure if/how it would actually increase the speed.
Thanks again
| [reply] [Watch: Dir/Any] |
|
The "if/how" is this:
Your first solution had an outer loop that runs 900,000 times. Inside that outer loop, there's an inner loop that runs 60,000 times. 60k * 900k is 54 billion total iterations inside the inner loop.
The proposed solution created a hash of 60000 elements. Then your 900000 line file is read line by line. Inside of that loop that iterates 900000 times, there's a hash lookup, which is almost free. There are a total of 60000 iterations needed to build the hash, and 900000 iterations needed to test each line of the FASTA file. The amount of work being done is, therefore, 960000 iterations.
Think of loops inside of loops as doing n*m amount of work, whereas loops followed by loops (no nesting) do n+m amount of work. Anytime you have the choice of an algorithm where the order of growth is the mathematical product of two large numbers, or an algorithm where the growth rate is the mathematical sum of the same two numbers, your sense of economic utility should be telling you that the latter will scale up better.
| [reply] [Watch: Dir/Any] |
Re: Searching array against hash
by abualiga (Scribe) on Aug 22, 2013 at 01:07 UTC
|
Hi,
out of curiosity, as I work with fasta files a bit, why not use the Bio::DB::Fasta module from BioPerl for indexed access to any number of fasta files. Your example is precisely what this module is for, unless you're exercising OO programming. I used your sample data for the code below, if interested.
update: Initially, creating the db may take a moment depending on size of input files, but it works pretty quick every time it's accessed after that.
#!/usr/bin/perl
use Modern::Perl '2011';
use autodie;
use Bio::DB::Fasta;
die "Usage: $0 <fastaFile> <queryFile>\n" unless my( $fastaFile, $quer
+yFile ) = @ARGV;
open my $fh, '<', $queryFile;
chomp( my @queries = <$fh> );
close $fh;
# trim header
@queries = map{ local $_ = $_; s/^>(\S+)\s+\d+/$1/; $_ } @queries;
# create indexed database from fasta file
my $db = Bio::DB::Fasta->new( $fastaFile );
open my $fh2, '>', 'querySeqs.fa';
for my $query ( @queries ) {
my $seqheader = $db->header( $query );
my $seqdna = $db->seq( $seqheader );
say {$fh2} ">$seqheader";
print {$fh2} format_sequence( $seqdna, 60 );
}
close $fh2;
sub format_sequence {
my($sequence, $length) = @_; # sequence and length of line
open my $string_fh, '>', \ my $string; # print sequence to string
for( my $pos = 0; $pos < length( $sequence ); $pos += $length ) {
print {$string_fh} substr( $sequence, $pos, $length ), "\n";
}
$string;
}
output in file querySeqs.fa
>Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent
+=AT1G11900.len361.i1.1_111
GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTT
TTCTGATTGTTACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGA
AATTTCGAACCATTGCTGTTCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACG
CTTCAGCTAGTGTTTGATTACGAAGTTTTCTGATTAAATGTGTGAGTTTTTTTGTAGTCA
TCTCGAAAACTGAGAAATCCATTTTTATAGATTTACATTGTTCATAGTTATATGTGGAAG
TTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGTTTATTGGCATTGCA
G
>Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent
+=AT1G76660.len490.i1.2_394
GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCT
CTGTCTCAGTTCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTT
TTTTCCTACTTAGATCTGACTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTAT
GTTGAATCTCTCCTTTGATCATTATGTTATTGTAAAAATCCCAGCTTTATGCTAAATCGA
GCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTACATTTTTCTGAATTGCCATTCAAT
TAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTCGATGCGTGTAAGCT
ATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAATTATC
TTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGT
CCTTTTGCAG
>Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent
+=AT1G31420.len415.i1.13_217
GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGG
AAAGAGTGGGGTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAA
TCTTCAGTTGTTATTGTGTTATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTT
AAGCTAGAGCTGTTTTTGAGCTCTTTTGTTGATGATGATTTTGAGATTGTTGGCCGAATT
TAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGTGTATTGGTTCTTGTGAGGCAATT
TCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTTGTGTGGAGACTGAG
CTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG
| [reply] [Watch: Dir/Any] [d/l] |
|
Without prejudicing the OPs response in any way, can I ask you, what you think using Bio::DB::Fasta would have over the OPs 50 line module?
Given that to get it, he would have to try and install it and its 897 codependants -- not to mention AnyDBM_File and at least one of DB_File GDBM_File NDBM_File SDBM_File; all of which are a known problem on my platform -- and somehow resolve the 95 "cannot resolve"s (whatever they are?):
all to install an indexing module that he doesn't need, and won't benefit from, in order to complete his task in "a couple seconds".
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
Without prejudicing the OPs response in any way, can I ask you, what you think using Bio::DB::Fasta would have over the OPs 50 line module?
Not having a CS background, I cannot analyze an algorithm for efficiency. However, Bio::DB::Fasta is an exisiting and, among biologists, widely used solution to a common task, presented by the OP. So, like you, I exercise the privilege to offer my solution.
Given that to get it, he would have to try and install it and its 897 codependants -- not to mention AnyDBM_File and at least one of DB_File GDBM_File NDBM_File SDBM_File; all of which are a known problem on my platform -- and somehow resolve the 95 "cannot resolve"s (whatever they are?):
So far, I've installed BioPerl on Ubuntu, RedHat, and Mac OS X without problems. If the OP happens to be in a biomedical field, it may be worthwhile to install and use BioPerl as it does solve many genomic tasks.
all to install an indexing module that he doesn't need, and won't benefit from, in order to complete his task in "a couple seconds".
"Without prejudicing the OPs response in any way"
| [reply] [Watch: Dir/Any] |
|
|
It will help if you are looking to retrieve a subsequence from the human genome, the FASTA file of which is about 5 Gb; in the case of parsing a smaller FASTA file with just a few entries, it makes less of a difference.
| [reply] [Watch: Dir/Any] |
|
|
|
|
|