Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

More efficient way to lookup with 2 AoA's.

by BioGeek (Hermit)
on Jul 27, 2004 at 20:39 UTC ( #377857=perlquestion: print w/ replies, xml ) Need Help??
BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

Hey All, I have 2 ArrayOfArrays, one with a gene name and its score, like this:
@gene_score = ( [ "gene_name_0", "score_0" ], [ "gene_name_1", "score_1" ], ... [ "gene_name_400", "score_400" ] );
and, one with a gene name and its start and stop positions on a chromosone:
@gene_start_stop_chr = ( [ "gene_name_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "start_1", "stop_1", "chr_1" ], ... [ "gene_name_30000", "start_30000", "stop_30000", "chr_30000 +" ] );
And of course I want to match the scores with the positions, using the gene names, so that I end with an array:
@results = ( [ "gene_name_0", "score_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "score_1", "start_1", "stop_1", "chr_1" ], ... [ "gene_name_400", "score_400", "start_400", "stop_400", "chr +_400" ], );
The code I've written so far is (@gene_start_stop_chr abbreviated till @gssc):
for (my $a = 0; $a < scalar @gene_score; $a++) { for (my $b = 0; $b < scalar @gssc; $b++) { if ("$gene_score[$a][0]" eq "$gssc[$b][0]") { print "$gene-score[$a][0]\t$gene_score[$a][1]\t$gssc[$ +b][1]\t$gssc[$b][2]\t$gssc[$b][3]\n"; } } }
Which works, but is very slow, as I am comparing each of the 400 gene names of my first array with every of the 30000 gene names in the second array. So I was wondering of there are changes I could make to speed things up.
Thanks in advance.

Comment on More efficient way to lookup with 2 AoA's.
Select or Download Code
Re: More efficient way to lookup with 2 AoA's.
by Zaxo (Archbishop) on Jul 27, 2004 at 20:50 UTC

    Use a hash with the gene names as keys. You can then put all the data in one structure,

    my %gene = ( gene_name_1 => { start=>'start_1', stop=>'stop_1', chr=>'chr_1'}, # ... );
    You could add the score data there, too, unless it's more dynamic than that and is generated elsewhere.

    After Compline,
    Zaxo

Re: More efficient way to lookup with 2 AoA's.
by rir (Vicar) on Jul 27, 2004 at 21:05 UTC
    Use a hash for your smaller array. Something like:
    $, = " "; # just playing with the =>'s my @gn_score = ( [ name_0 => score_0 => ], [ name_1 => score_1 => ], [ name_2 => score_2 => ], ); my @gn_start_stop_chr = ( [ name_0 => b_0 => e_0 => ], [ name_1 => b_1 => e_1 => ], [ name_2 => b_2 => e_2 => ], [ name_0 => b_30 => e_3 => ], [ name_2 => b_42=> e_4 => ], [ name_1 => b_51=> e_5 => ], ); my %score; $score{$_->[0]} = $_->[1] for (@gn_score ); for ( @gn_start_stop_chr) { my ( $name => $begin => $end => ) = @$_; die unless exists $score{$name}; print $name, # or stash your data somewhere $score{$name}, $begin, $end, $/; };
Re: More efficient way to lookup with 2 AoA's.
by CountZero (Bishop) on Jul 27, 2004 at 21:08 UTC
    Dump both AoA's into a database (each in its own table) and do a SELECT on both tables joined by the keyfield of gene_name. Somehow you will have to persist the AoA's or are you going to input them each time by hand again (or perhaps read them from a flat file)? What better way then than to put them in database from the start?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: More efficient way to lookup with 2 AoA's.
by BrowserUk (Pope) on Jul 27, 2004 at 21:09 UTC

    Like everyone says--whenever you need to do a lookup in Perl: Think hashes,

    #! perl -slw use strict; use Data::Dumper; my @gene_score = ( [ "gene_name_0", "score_0" ], [ "gene_name_1", "score_1" ], # ... [ "gene_name_400", "score_400" ] ); my @gene_start_stop_chr = ( [ "gene_name_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "start_1", "stop_1", "chr_1" ], # ... [ "gene_name_400", "start_400", "stop_400", "chr_400" ], [ "gene_name_30000", "start_30000", "stop_30000", "chr_30000 +" ] ); ## Build a hash from the lookup array my %gene_start_stop_chr = map{ $_->[ 0 ] => [ @{ $_ }[ 1 .. 3 ] ] } @gene_start_stop_chr; ## Use it to map the inputs to results my @results = map{ [ $_->[ 0 ], $_->[ 1 ], @{ $gene_start_stop_chr{ $_->[ 0 ] } } ] } @gene_score; print Dumper \@results; __END__ P:\test>377857 $VAR1 = [ [ 'gene_name_0', 'score_0', 'start_0', 'stop_0', 'chr_0' ], [ 'gene_name_1', 'score_1', 'start_1', 'stop_1', 'chr_1' ], [ 'gene_name_400', 'score_400', 'start_400', 'stop_400', 'chr_400' ] ];

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: More efficient way to lookup with 2 AoA's.
by bgreenlee (Friar) on Jul 27, 2004 at 21:12 UTC

    I wrote something up, but Zaxo (and now rir) beat me to the punch, so instead here's some code to convert your arrays into a single hash:

    my %gene = (); foreach (@gene_score) { $gene{$_->[0]}->{score} = $_->[1]; } foreach (@gssc) { $gene{$_->[0]}->{start} = $_->[1]; $gene{$_->[0]}->{stop} = $_->[2]; $gene{$_->[0]}->{chr} = $_->[3]; }

    Now %gene looks like:

    %gene = ( gene_name_0 => { score => 'score_0', start => 'start_0', stop => 'stop_0', chr => 'chr_0' }, gene_name_1 => { score => 'score_1', ... );

    Brad

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://377857]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2015-07-04 18:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls