Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

A better way to make the script run faster?

by rocketperl (Acolyte)
on Jul 31, 2013 at 13:03 UTC ( #1047258=perlquestion: print w/ replies, xml ) Need Help??
rocketperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I basically have 2 arrays(populated from input file) and all i have to do is, search the values from one array and see if it is there in the other and record the no of occurrences as well. my main query array has 375 values and my other search array has 73372 values. so the number of searches totally will be 27514500 on the worst case and it taking ages to finish running the program. I do have to also include more functions to this script and im afraid how long that is going to take more. Please advice me why my program is so slow and any other alternative ideas that i could use.
sample values of @gl CD9 TBN NANOG KITL FUT4 SALL4 MYC STAT3 ESRRB AKP2 SOX2 POU5F1 KLF4
sample of @hfr_genes LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1 LYPLA1
my code: do { if( my$index_gl=grep{$hfr_genes[$_]=~/^$gl[$index]$/}0..$#hfr_gene +s) { print TEST "Val $gl[$index] is present $index_gl times\n"; $index++; } else { $index++; } } until ($index==(scalar(@hfr_genes)));
please help! thanks

Comment on A better way to make the script run faster?
Select or Download Code
Re: A better way to make the script run faster?
by Skeeve (Vicar) on Jul 31, 2013 at 13:15 UTC

    Why do you have the strings in arrays and not in hashes?

    I'd put the array you search in into a hash where the array elements would be the keys and the value would be the number of occurences.

    Finding then the number of occurences for each of your searched keys is simply getting the value from the hash.

    foreach my $g (@gl) { print "$g occures ",($hfr_genes{$g} || 0)," time(s)\n"; }

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Thank you! I sorry but my knowledge with perl is basic as im just a beginner. Hashes did the job really well. But the $hfr_genes{$g} in the print statement prints the -1 location of the occurrence of the values. Is there a way where i can record the number of occurrences? Thanks again!
        See tobyink's answer below.

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: A better way to make the script run faster?
by ww (Bishop) on Jul 31, 2013 at 13:17 UTC
    Big G and Super Search are your friends; "intersection" is probably the keyword of choice.
    If I've misconstrued your question or the logic needed to answer it, I offer my apologies to all those electrons which were inconvenienced by the creation of this post.
Re: A better way to make the script run faster?
by tobyink (Abbot) on Jul 31, 2013 at 13:22 UTC

    Use a frickin hash!

    use strict; use warnings; my @gl = qw( CD9 TBN NANOG KITL FUT4 SALL4 MYC STAT3 ESRRB AKP2 SOX2 POU5F1 KLF4 ); my @hfr_genes = qw( LYPLA1 LYPLA1 LYPLA1 LYPLA1 STAT3 LYPLA1 LYPLA1 STAT3 LYPLA1 LYPLA1 LYPLA1 LYPLA1 SOX2 ); # First thing: convert @hfr_genes into a hash!!! # Looking up a hash key is much faster than grepping an array. my %hfr_genes; $hfr_genes{$_}++ for @hfr_genes; # Now loop through the first list for (@gl) { my $count = $hfr_genes{$_}; print "Value $_ is present $count times\n" if $count; }
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: A better way to make the script run faster?
by SuicideJunkie (Priest) on Jul 31, 2013 at 13:27 UTC

    Doing 27 million of anything is going to take a while.

    I should note that =~/^$gl[$index]$/ is better written as eq $gl[$index] unless you have regex metacharacters in your @gl which you have not shown.

    A better way to go about this, is to scan @hfr_genes and count the genes using a hash, which will have {LYPLA1 => 13245, CD9=> 42, MYC=>13, ...}. That way, you only scan through your 73372 values ONCE. Then print the tallies by looping over the 375 values once.

    73372 + 375 << 73372 * 375 ;)

Re: A better way to make the script run faster?
by sundialsvc4 (Monsignor) on Aug 01, 2013 at 00:58 UTC

    Naturally, another possibility is to use SQL tables (or an SQLite file).   If you have entries in table-A that represent search-keys, and you want to know how many records (if any...) exist in table-B for each one of those keys, then a very simple LEFT OUTER JOIN query with a GROUP BY clause will produce the answer, all at once, with no programming involved, Perl or otherwise.   Is that a useful possibility here?

      will produce the answer, all at once, with no programming involved, Perl or otherwise.

      Utter baloney! Which if you'd ever tried it, or anything like it, you'd know.

      Is that a useful possibility here?

      Will this make his script go faster per his requirement? Of course not!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1047258]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (10)
As of 2014-08-28 06:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (257 votes), past polls