Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

A better way to make the script run faster?

by rocketperl (Sexton)
on Jul 31, 2013 at 13:03 UTC ( [id://1047258]=perlquestion: print w/replies, xml ) Need Help??

rocketperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I basically have 2 arrays(populated from input file) and all i have to do is, search the values from one array and see if it is there in the other and record the no of occurrences as well. my main query array has 375 values and my other search array has 73372 values. so the number of searches totally will be 27514500 on the worst case and it taking ages to finish running the program. I do have to also include more functions to this script and im afraid how long that is going to take more. Please advice me why my program is so slow and any other alternative ideas that i could use.
sample values of @gl CD9 TBN NANOG KITL FUT4 SALL4 MYC STAT3 ESRRB AKP2 SOX2 POU5F1 KLF4
sample of @hfr_genes LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1
my code: do { if( my$index_gl=grep{$hfr_genes[$_]=~/^$gl[$index]$/}0..$#hfr_gene +s) { print TEST "Val $gl[$index] is present $index_gl times\n"; $index++; } else { $index++; } } until ($index==(scalar(@hfr_genes)));
please help! thanks

Replies are listed 'Best First'.
Re: A better way to make the script run faster?
by Skeeve (Parson) on Jul 31, 2013 at 13:15 UTC

    Why do you have the strings in arrays and not in hashes?

    I'd put the array you search in into a hash where the array elements would be the keys and the value would be the number of occurences.

    Finding then the number of occurences for each of your searched keys is simply getting the value from the hash.

    foreach my $g (@gl) { print "$g occures ",($hfr_genes{$g} || 0)," time(s)\n"; }

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Thank you! I sorry but my knowledge with perl is basic as im just a beginner. Hashes did the job really well. But the $hfr_genes{$g} in the print statement prints the -1 location of the occurrence of the values. Is there a way where i can record the number of occurrences? Thanks again!
        See tobyink's answer below.

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: A better way to make the script run faster?
by tobyink (Canon) on Jul 31, 2013 at 13:22 UTC

    Use a frickin hash!

    use strict; use warnings; my @gl = qw( CD9 TBN NANOG KITL FUT4 SALL4 MYC STAT3 ESRRB AKP2 SOX2 POU5F1 KLF4 ); my @hfr_genes = qw( LYPLA1 LYPLA1 LYPLA1 LYPLA1 STAT3 LYPLA1 LYPLA1 STAT3 LYPLA1 LYPLA1 LYPLA1 LYPLA1 SOX2 ); # First thing: convert @hfr_genes into a hash!!! # Looking up a hash key is much faster than grepping an array. my %hfr_genes; $hfr_genes{$_}++ for @hfr_genes; # Now loop through the first list for (@gl) { my $count = $hfr_genes{$_}; print "Value $_ is present $count times\n" if $count; }
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: A better way to make the script run faster?
by SuicideJunkie (Vicar) on Jul 31, 2013 at 13:27 UTC

    Doing 27 million of anything is going to take a while.

    I should note that =~/^$gl[$index]$/ is better written as eq $gl[$index] unless you have regex metacharacters in your @gl which you have not shown.

    A better way to go about this, is to scan @hfr_genes and count the genes using a hash, which will have {LYPLA1 => 13245, CD9=> 42, MYC=>13, ...}. That way, you only scan through your 73372 values ONCE. Then print the tallies by looping over the 375 values once.

    73372 + 375 << 73372 * 375 ;)

Re: A better way to make the script run faster?
by ww (Archbishop) on Jul 31, 2013 at 13:17 UTC
    Big G and Super Search are your friends; "intersection" is probably the keyword of choice.
    If I've misconstrued your question or the logic needed to answer it, I offer my apologies to all those electrons which were inconvenienced by the creation of this post.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1047258]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-03-29 01:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found