Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

search and extract from a large hash

by reubs85 (Acolyte)
on Jun 07, 2011 at 10:54 UTC ( #908447=perlquestion: print w/ replies, xml ) Need Help??
reubs85 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, may I request your consummate wisdom on a wee question I have?

I have a hash, where the key is a unique ID tag and the value is the genetic data for the gene corresponding to that ID. In the kinds of analyses I do I am generally dealing with relatively large hashes, maybe ~15000 key/value pairs kinda thing. I want to extract the sequence information for a given subset of genes for which I have the ID's stored in an array. I can do it with the following code, which works pretty well (it uses a bit of BioPerl...):

#!/usr/bin/perl use strict; use warnings; use Bio::Seq; use Bio::SeqIO; my $uniqueFile = $ARGV[0]; my $goodProteinsFile = $ARGV[1]; ## imports a bunch of gene IDs open (FILE, $uniqueFile); my @data_in = <FILE>; close FILE; ## create lookup hash; key = ID, val = sequence my %goodProteins_hash; my $in = Bio::SeqIO->new(-file=>$goodProteinsFile, -format=>'Fasta'); while (my $seq = $in -> next_seq() ) { my $id = $seq -> display_id(); my $seq_string = $seq -> seq(); $goodProteins_hash{$id} = $seq_string; } my $file_out = "strainSpecific_seqData.protein.fasta"; ## iterate thru @data_in; if $id eq $_ then get at the value in ## %goodProteins hash and store it in %strSpec... takes a while! my %strSpec; foreach (@data_in) { chomp ($_); while (my ($id, $seq) = each %goodProteins_hash) { if ($_ =~ /($id)$/) { $strSpec_protein_hash{$id} = $seq; } } } open (OUT, ">strainSpecific_seqData.protein.fasta"); while (my ($k, $v) = each %strSpec) { ## print to file print OUT "\>$k\n$v\n"; } close OUT; print "- Finished\n";

So I am getting to the sequence data for the IDs I want by looping through the '@data_in' array, then using a 'while each' on the %goodProteins_hash followed by an if... Perhaps not surprisingly this takes quite a long time per input ID, and if I want to get out a lot of sequences it takes ages!

So my question is: is there a quicker and more efficient way of doing something like this?? I tried playing around with grep and exists etc but I couldn't get it to do what I wanted...

Your responses, as always, are very much appreciated!

Thanks :-)

Comment on search and extract from a large hash
Download Code
Replies are listed 'Best First'.
Re: search and extract from a large hash
by jethro (Monsignor) on Jun 07, 2011 at 11:05 UTC

    If I read your code correctly, you only need this:

    foreach (@data_in) { chomp; $strSpec_protein_hash{$_} = $goodProteins_hash{$_}; }

    This is exactly what hashes are good at and your loop is in a way removing that advantage again ;-)

      Holy crap, you have no idea how much time that will save me!! I suspected all along that I was doing something daft, but I'm still getting to grips with Perl so once I'd found a way that worked i was reluctant to fiddle...

      Thanks again man

      Reuben

      Or just this:

      chomp @data_in; @strSpec_protein_hash{ @data_in } = @goodProteins_hash{ @data_in };
Re: search and extract from a large hash
by johngg (Abbot) on Jun 07, 2011 at 16:16 UTC

    A few further points to note:-

    • Use the three-argument form of open with lexical file handles and check for success, giving the o/s error on failure, e.g.

      open my $uniqueFH, '<', $uniqueFile or die "open: < $uniqueFile: $!\n";

    • You can do your chomping in one fell swoop

      chomp( my @data_in = <$uniqueFH> );

      rather than piecemeal.

    • You could consider using a hash slice rather than looping although this might not be practicable with very large data sets

      my %strSpec_protein_hash; @strSpec_protein_hash{ @data_in } = @goodProteins_hash{ @data_in };

    • I wonder if you have a cut'n'paste error in your post, my %strSpec; versus $strSpec_protein_hash{$id} = $seq;

    I hope these points are helpful.

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://908447]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2015-07-08 00:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls