Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
There's more than one way to do things
 
PerlMonks  

Searching array against hash

by drhicks (Novice)
on Aug 21, 2013 at 21:01 UTC ( #1050420=perlquestion: print w/ replies, xml ) Need Help??
drhicks has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have two files. File 1 is a FASTA file with a header line beginning with ">" and DNA sequence on the subsequent lines until the next ">" header. File 2 contains line after line formatted the same as the header line in file 1 except with a tab and score value added. There are only a fraction of the header lines in file2 as there are in file 1. File1 contains roughly 900,000 header lines and the corresponding DNA sequence for each. FIle2 contains roughly 60,000 header lines. I want a script to find matching header lines from file 2 and print the header along with the corresponding DNA sequence for that header from file1. I wrote a script that is successful in doing what I want, but it is far too slow to be used. After ~1 hr, it had returned only ~700 correct entries out of ~60,000 total. A small example of file1 and file 2 are shown below.

file1 >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTTTTCTGATTGT +TACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGAAATTTCGAACCATTGCTGT +TCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACGCTTCAGCTAGTGTTTGATTACGAAGTTT +TCTGATTAAATGTGTGAGTTTTTTTGTAGTCATCTCGAAAACTGAGAAATCCATTTTTATAGATTTACA +TTGTTCATAGTTATATGTGGAAGTTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGT +TTATTGGCATTGCAG >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCTCTGTCTCAGT +TCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTTTTTTCCTACTTAGATCTGA +CTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTATGTTGAATCTCTCCTTTGATCATTATGTT +ATTGTAAAAATCCCAGCTTTATGCTAAATCGAGCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTAC +ATTTTTCTGAATTGCCATTCAATTAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTC +GATGCGTGTAAGCTATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAA +TTATCTTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGTCCTT +TTGCAG >Chr1.ID=PAC:19657550;Name=AT1G53750.1;pacid=19657550;longest=1;Parent +=AT1G53750.len344.i1.9_229 GTAAGCCTTCCCCTTTTAGAACCCTAAGTTTTATTGGGGTTTTCGATTTTTACTCTTCTGATTCATCGGA +GAATTCGGATCTACACTAGATTTTAGTTACTCGAATGTGAGGGTTTCGTCTCTTTGCAAACCAATTTGA +TGTTTCCTCCTGAGCTAGATATGTTCTTGATGAGCTTGATTTTTCTACTTGGTTCAGTTTTTTTTGCTA +ATTACTACTTATATGAGTGAATCTGCCTCACTGTTTGAATTTATTCCAAGTGGAAATATTCATAGTCAT +GCTTTGTTGTATCTGTTATCTCTCCATATGTTGTGTTGCTGACCTTGTTAAATCTCATTCTCTGCAG >Chr1.ID=PAC:19650963;Name=AT1G47740.1;pacid=19650963;longest=1;Parent +=AT1G47740.len1091.i1.4_277 GTAAGTTAATCTTCTCTTCTGAAAATTGAATTTGGTGTATCAATTCTTACATTATCTTGAAGATTCATCT +CTGAATTTCTCAAATTTATGGGGGTTTTTTGTTTGTCGGAATTGCCGGAGAAATTGGAAAAAACGAGAT +CTTTGAGTAAAGGGTTTGTTTAATCTTTAGTCTTTATTGCTTTCCTTAAGCTAATTTTGGCAGATCTGG +AACATAAACCCTAGAACAAGACCAAATCAGTGTCTCCTTACTCTTAGGGATTTTAGTCTCTGTGATACC +TTAAATGTGTTTATAAATTGACTGTGCTTAATGGGTCACATTTGATTTGCAATAAAAGTTTCACAATTC +TTCCATTTTCAGTAATGTAAGCTCCAGTTTTCAAGATTTACTATTTTGGGGAACTAGTTAGGTTTGTAG +GTTTATTAGATCTTAGAAACACTAATGTAATGCTGTTTGTTTGGTACTCTTTAATATTCACTATTCATC +TTAATGTGGAATCAAATTTGCTTTTTTTGGTCAGACTCAATTAGTTGGAATGAGTTGTTGAACCATTGA +CTTGTCTCCTAAGCTCTTACTTAATCTATCTGTATCATCTTCTCCTGTCTATTTCTCTTTATTTAATAC +ATAACTCGTTCATGATTGCATCTGACATGGTAGCAACTCTTTGTGAACAAGACTTGATTTTATAACACT +GTAACATGACCACACTTTTTCCTTTTGATCTCTGTATATTTGGGGAAGTAAGGATTTGGTTTAACAATG +ATACAAAATCACAGTTTAGATGACTCAGTCTTGTCTTTTATCATTAAGATGGAAAAAATGAGACCAATC +TTGTCTTGCTTTATTCTAAGATGCATGCTCTATTATATTTTATGCTTTTGTATATATAACTGATTGAAG +CCGGCCAATGGAGATTGGTGCTGACTTTTTAAATGAGCTGTTGTTGTATTTCAGCCAACTAGCATAAGA +TAAAATAAAATAGGAAAATTTTTCATTTCAGTCTCTTAAGTTCAATGAAAACATAATGTGGCACAGAGG +TCTTTGCTTTTGTACCTTTCAGAATCTTTTATTGATTGATGATGTTTACATACAG >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGGAAAGAGTGGG +GTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAATCTTCAGTTGTTATTGTGT +TATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTTAAGCTAGAGCTGTTTTTGAGCTCTTTTG +TTGATGATGATTTTGAGATTGTTGGCCGAATTTAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGT +GTATTGGTTCTTGTGAGGCAATTTCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTT +GTGTGGAGACTGAGCTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG
file2 >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 100 >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 34 >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 76
Here is the main script that I wrote:
#!/usr/bin/perl use strict; use warnings; use diagnostics; use FAlite; die "usage: $0 <fasta> <list>\n" unless @ARGV == 2; open(FASTA, $ARGV[0]) or die; my $fasta = new FAlite(\*FASTA); while (my $entry = $fasta->nextEntry) { my $gseq = $entry->seq; my ($chrom) = $entry->def =~ /(^\S+)/; my @a1 = $chrom; my %hash; $hash{$_}++ for @a1; my $header; open(LIST, $ARGV[1]) or die; while (<LIST>) { next if /^#/; my ($header, $score) = split; my @a2 = $header; for my $item (@a2) { if ($hash{$item}) {print "$chrom\n$gseq\n"}; } } } close FASTA; close LIST; __END__
Here is the perl module that the above code uses:
package FAlite; use strict; sub new { my ($class, $fh) = @_; if (ref $fh !~ /GLOB/) {die ref $fh, "\n", "FAlite ERROR: expect a GLOB reference\n"} my $this = bless {}; $this->{FH} = $fh; while(<$fh>) {last if $_ =~ /\S/} # not supposed to have blanks, b +ut... my $firstline = $_; if (not defined $firstline) {warn "FAlite: Empty\n"; return $this} if ($firstline !~ /^>/) {warn "FAlite: Not FASTA formatted\n"; ret +urn $this} $this->{LASTLINE} = $firstline; chomp $this->{LASTLINE}; return $this; } sub nextEntry { my ($this) = @_; return 0 if not defined $this->{LASTLINE}; my $fh = $this->{FH}; my $def = $this->{LASTLINE}; my @seq; my $lines_read = 0; while(<$fh>) { $lines_read++; if ($_ =~ /^>/) { $this->{LASTLINE} = $_; chomp $this->{LASTLINE}; last; } push @seq, $_; } return 0 if $lines_read == 0; chomp @seq; my $entry = FAlite::Entry::new($def, \@seq); return $entry; } package FAlite::Entry; use overload '""' => 'all'; sub new { my ($def, $seqarry) = @_; my $this = bless {}; $this->{DEF} = $def; $this->{SEQ} = join("", @$seqarry); $this->{SEQ} =~ s/\s//g; # just in case more spaces return $this; } sub def {shift->{DEF}} sub seq {shift->{SEQ}} sub all {my $e = shift; return $e->{DEF}."\n".$e->{SEQ}."\n"} 1;
While my script is working, I feel that there should be a more efficient way to search my array against my hash and print the associated header and DNA sequence. Thanks for any help, Derrick

Comment on Searching array against hash
Select or Download Code
Re: Searching array against hash
by BrowserUk (Pope) on Aug 21, 2013 at 21:37 UTC
    I feel that there should be a more efficient way

    There is. Your current process is ~O(900,000*60,000) = 54 billion.

    If you reverse the logic of your lookups, by reading the list file into a hash first; you can then read the fasta file line by line and lookup the headers in the hash, which results in an O( 900,000 ) process.

    Try this (untested) version which should run close to 60,000 times faster:

    #!/usr/bin/perl use strict; use warnings; use diagnostics; use FAlite; die "usage: $0 <fasta> <list>\n" unless @ARGV == 2; my %list; open(LIST, $ARGV[1]) or die; while (<LIST>) { next if /^#/; my ($header, $score) = split; $list{ $header } = $score; } close LIST; open(FASTA, $ARGV[0]) or die; my $fasta = new FAlite(\*FASTA); while (my $entry = $fasta->nextEntry) { if( exists $list{ $entry->def } ) { print $entry->def; print $entry->seq; } } close FASTA; __END__

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Wow thanks, only takes a couple seconds to finish! I had attempted to do the same thing, but could never get it working, and wasn't sure if/how it would actually increase the speed. Thanks again

        The "if/how" is this:

        Your first solution had an outer loop that runs 900,000 times. Inside that outer loop, there's an inner loop that runs 60,000 times. 60k * 900k is 54 billion total iterations inside the inner loop.

        The proposed solution created a hash of 60000 elements. Then your 900000 line file is read line by line. Inside of that loop that iterates 900000 times, there's a hash lookup, which is almost free. There are a total of 60000 iterations needed to build the hash, and 900000 iterations needed to test each line of the FASTA file. The amount of work being done is, therefore, 960000 iterations.

        Think of loops inside of loops as doing n*m amount of work, whereas loops followed by loops (no nesting) do n+m amount of work. Anytime you have the choice of an algorithm where the order of growth is the mathematical product of two large numbers, or an algorithm where the growth rate is the mathematical sum of the same two numbers, your sense of economic utility should be telling you that the latter will scale up better.


        Dave

Re: Searching array against hash
by abualiga (Scribe) on Aug 22, 2013 at 01:07 UTC

    Hi, out of curiosity, as I work with fasta files a bit, why not use the Bio::DB::Fasta module from BioPerl for indexed access to any number of fasta files. Your example is precisely what this module is for, unless you're exercising OO programming. I used your sample data for the code below, if interested.

    update: Initially, creating the db may take a moment depending on size of input files, but it works pretty quick every time it's accessed after that.

    #!/usr/bin/perl use Modern::Perl '2011'; use autodie; use Bio::DB::Fasta; die "Usage: $0 <fastaFile> <queryFile>\n" unless my( $fastaFile, $quer +yFile ) = @ARGV; open my $fh, '<', $queryFile; chomp( my @queries = <$fh> ); close $fh; # trim header @queries = map{ local $_ = $_; s/^>(\S+)\s+\d+/$1/; $_ } @queries; # create indexed database from fasta file my $db = Bio::DB::Fasta->new( $fastaFile ); open my $fh2, '>', 'querySeqs.fa'; for my $query ( @queries ) { my $seqheader = $db->header( $query ); my $seqdna = $db->seq( $seqheader ); say {$fh2} ">$seqheader"; print {$fh2} format_sequence( $seqdna, 60 ); } close $fh2; sub format_sequence { my($sequence, $length) = @_; # sequence and length of line open my $string_fh, '>', \ my $string; # print sequence to string for( my $pos = 0; $pos < length( $sequence ); $pos += $length ) { print {$string_fh} substr( $sequence, $pos, $length ), "\n"; } $string; } output in file querySeqs.fa >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTT TTCTGATTGTTACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGA AATTTCGAACCATTGCTGTTCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACG CTTCAGCTAGTGTTTGATTACGAAGTTTTCTGATTAAATGTGTGAGTTTTTTTGTAGTCA TCTCGAAAACTGAGAAATCCATTTTTATAGATTTACATTGTTCATAGTTATATGTGGAAG TTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGTTTATTGGCATTGCA G >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCT CTGTCTCAGTTCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTT TTTTCCTACTTAGATCTGACTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTAT GTTGAATCTCTCCTTTGATCATTATGTTATTGTAAAAATCCCAGCTTTATGCTAAATCGA GCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTACATTTTTCTGAATTGCCATTCAAT TAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTCGATGCGTGTAAGCT ATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAATTATC TTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGT CCTTTTGCAG >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGG AAAGAGTGGGGTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAA TCTTCAGTTGTTATTGTGTTATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTT AAGCTAGAGCTGTTTTTGAGCTCTTTTGTTGATGATGATTTTGAGATTGTTGGCCGAATT TAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGTGTATTGGTTCTTGTGAGGCAATT TCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTTGTGTGGAGACTGAG CTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG

      Without prejudicing the OPs response in any way, can I ask you, what you think using Bio::DB::Fasta would have over the OPs 50 line module?

      Given that to get it, he would have to try and install it and its 897 codependants -- not to mention AnyDBM_File and at least one of DB_File GDBM_File NDBM_File SDBM_File; all of which are a known problem on my platform -- and somehow resolve the 95 "cannot resolve"s (whatever they are?):

      all to install an indexing module that he doesn't need, and won't benefit from, in order to complete his task in "a couple seconds".


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        It will help if you are looking to retrieve a subsequence from the human genome, the FASTA file of which is about 5 Gb; in the case of parsing a smaller FASTA file with just a few entries, it makes less of a difference.

        Bioinformatics
        Without prejudicing the OPs response in any way, can I ask you, what you think using Bio::DB::Fasta would have over the OPs 50 line module?

        Not having a CS background, I cannot analyze an algorithm for efficiency. However, Bio::DB::Fasta is an exisiting and, among biologists, widely used solution to a common task, presented by the OP. So, like you, I exercise the privilege to offer my solution.

        Given that to get it, he would have to try and install it and its 897 codependants -- not to mention AnyDBM_File and at least one of DB_File GDBM_File NDBM_File SDBM_File; all of which are a known problem on my platform -- and somehow resolve the 95 "cannot resolve"s (whatever they are?):

        So far, I've installed BioPerl on Ubuntu, RedHat, and Mac OS X without problems. If the OP happens to be in a biomedical field, it may be worthwhile to install and use BioPerl as it does solve many genomic tasks.

        all to install an indexing module that he doesn't need, and won't benefit from, in order to complete his task in "a couple seconds".

        "Without prejudicing the OPs response in any way"

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1050420]
Approved by davido
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2014-04-24 10:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls