Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

search and extract from a large hash

by reubs85 (Acolyte)
on Jun 07, 2011 at 10:54 UTC ( #908447=perlquestion: print w/replies, xml ) Need Help??
reubs85 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, may I request your consummate wisdom on a wee question I have?

I have a hash, where the key is a unique ID tag and the value is the genetic data for the gene corresponding to that ID. In the kinds of analyses I do I am generally dealing with relatively large hashes, maybe ~15000 key/value pairs kinda thing. I want to extract the sequence information for a given subset of genes for which I have the ID's stored in an array. I can do it with the following code, which works pretty well (it uses a bit of BioPerl...):

#!/usr/bin/perl use strict; use warnings; use Bio::Seq; use Bio::SeqIO; my $uniqueFile = $ARGV[0]; my $goodProteinsFile = $ARGV[1]; ## imports a bunch of gene IDs open (FILE, $uniqueFile); my @data_in = <FILE>; close FILE; ## create lookup hash; key = ID, val = sequence my %goodProteins_hash; my $in = Bio::SeqIO->new(-file=>$goodProteinsFile, -format=>'Fasta'); while (my $seq = $in -> next_seq() ) { my $id = $seq -> display_id(); my $seq_string = $seq -> seq(); $goodProteins_hash{$id} = $seq_string; } my $file_out = "strainSpecific_seqData.protein.fasta"; ## iterate thru @data_in; if $id eq $_ then get at the value in ## %goodProteins hash and store it in %strSpec... takes a while! my %strSpec; foreach (@data_in) { chomp ($_); while (my ($id, $seq) = each %goodProteins_hash) { if ($_ =~ /($id)$/) { $strSpec_protein_hash{$id} = $seq; } } } open (OUT, ">strainSpecific_seqData.protein.fasta"); while (my ($k, $v) = each %strSpec) { ## print to file print OUT "\>$k\n$v\n"; } close OUT; print "- Finished\n";

So I am getting to the sequence data for the IDs I want by looping through the '@data_in' array, then using a 'while each' on the %goodProteins_hash followed by an if... Perhaps not surprisingly this takes quite a long time per input ID, and if I want to get out a lot of sequences it takes ages!

So my question is: is there a quicker and more efficient way of doing something like this?? I tried playing around with grep and exists etc but I couldn't get it to do what I wanted...

Your responses, as always, are very much appreciated!

Thanks :-)

Replies are listed 'Best First'.
Re: search and extract from a large hash
by jethro (Monsignor) on Jun 07, 2011 at 11:05 UTC

    If I read your code correctly, you only need this:

    foreach (@data_in) { chomp; $strSpec_protein_hash{$_} = $goodProteins_hash{$_}; }

    This is exactly what hashes are good at and your loop is in a way removing that advantage again ;-)

      Holy crap, you have no idea how much time that will save me!! I suspected all along that I was doing something daft, but I'm still getting to grips with Perl so once I'd found a way that worked i was reluctant to fiddle...

      Thanks again man


      Or just this:

      chomp @data_in; @strSpec_protein_hash{ @data_in } = @goodProteins_hash{ @data_in };
Re: search and extract from a large hash
by johngg (Abbot) on Jun 07, 2011 at 16:16 UTC

    A few further points to note:-

    • Use the three-argument form of open with lexical file handles and check for success, giving the o/s error on failure, e.g.

      open my $uniqueFH, '<', $uniqueFile or die "open: < $uniqueFile: $!\n";

    • You can do your chomping in one fell swoop

      chomp( my @data_in = <$uniqueFH> );

      rather than piecemeal.

    • You could consider using a hash slice rather than looping although this might not be practicable with very large data sets

      my %strSpec_protein_hash; @strSpec_protein_hash{ @data_in } = @goodProteins_hash{ @data_in };

    • I wonder if you have a cut'n'paste error in your post, my %strSpec; versus $strSpec_protein_hash{$id} = $seq;

    I hope these points are helpful.



Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://908447]
Approved by Corion
Front-paged by Corion
and the grasshoppers chirp...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2018-03-20 08:23 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (248 votes). Check out past polls.