Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Optimize a perl block

by IruP (Initiate)
on Sep 06, 2017 at 09:55 UTC ( [id://1198762]=perlquestion: print w/replies, xml ) Need Help??

IruP has asked for the wisdom of the Perl Monks concerning the following question:

2017-10-07 Note added by Athanasius: This node was copied from StackExchange. See also Re: Extract specify values in file log using perl.


hello perlmonks. I would like to optimize the following code:

foreach my $a (@{ $array_ref }) { my $i = 0; foreach my $b (split('-', $a)) { foreach my $c (keys %{ $hash_ref }) { $i += $foo->{$b}->{$c}->{'value'}; } } push @{ $bar->{$i} }, $a; }

This loop take lots of time (about 30s on my server) because array_ref contains few millions of entries, it is not a memory problem but only CPU.

The split function returns only five elements and hash_ref is small (10 keys).

What changes I can do to run it faster !

Replies are listed 'Best First'.
Re: Optimize a perl block
by Eily (Monsignor) on Sep 06, 2017 at 10:25 UTC

    One thing I see is that $hash_ref seems to be constant, so you can compute the list of keys once instead of each time: @keys = keys %$hash_ref; and for my $key (@keys)

    And if several $a values can share $b sub values, you can keep the subsums in cache to avoid recomputing them (untested because you didn't provide sample data):

    use List::Util qw( sum ); my %cache; my @keys = keys %$hash_ref; foreach my $a_ (@{ $array_ref }) { my $i = 0; foreach my $b_ (split('-', $a_)) { $i += $cache{$b_} //= sum map { $_->{value} } @{ $foo->{$b_} } +{@keys}; # Use the cached value or compute the sum } push @{ $bar->{$i} }, $a_; }
    Using the orcish manoeuver, List::Util, and a slice

    I suppose the name are only because this is an example? Even in a example I would avoid $a and $b, (even more so when using List::Util) because of their special meaning. (Edit) And I avoid single char names as a general rule

    Edit $hash_ref->{$b_ } should have been $foo->{$b_}. Thanks kcott

      G'day Eily,

      ++ I tried a few things independently; however, it appears that much of that is very similar to what you've done, so I'll post it here for comparison.

      "my @keys = keys %$hash_ref;" was one of my first thoughts and this appeared to be a definite winner. I also tried inlining the results (now discarded, but it was something like: "sub hrkeys () { keys %$hash_ref }"): that proved to be slower than using "@keys".

      I had code very similar to your "sum map ..."; although, I used sum0. That appeared to be slower (even with the "map EXPR" form I used); I suspect any gains from sum were overshadowed by losses from map; I didn't investigate that any further.

      I didn't think of caching. That's a good idea, and might put "sum(0) map" back in the picture; however, as you stated, that will depend on the OP's data (which hasn't been shown).

      I dummied up some test data (based on the OP's description but, I'm sure, far from representative); ran some basic timings; and included some sanity checking. Here's the code I found to be fastest.

      #!/usr/bin/env perl -l use strict; use warnings; use Time::HiRes 'time'; my $hash_ref; @$hash_ref{'a' .. 'j'} = 1 .. 10; my $array_ref = [ ('v-w-x-y-z') x 2e6 ]; my $foo; my $value = 0; for my $outer ('v' .. 'z') { $foo->{$outer}{$_}{value} = ++$value for 'a' .. 'j'; } my $t0 = time; op_code(); my $t1 = time; printf "op_code: %.6f\n", $t1 - $t0; kens_code(); my $t2 = time; printf "kens_code: %.6f\n", $t2 - $t1; print '*** Compare ***'; printf "kens/op: %.6f%%\n", (($t2 - $t1) / ($t1 - $t0)) * 100; sub op_code { my $bar; foreach my $a (@{ $array_ref }) { my $i = 0; foreach my $b (split('-', $a)) { foreach my $c (keys %{ $hash_ref }) { $i += $foo->{$b}->{$c}->{'value'}; } } push @{ $bar->{$i} }, $a; } print '*** op_code ***'; print "@{[ $_, $#{$bar->{$_}}, $bar->{$_}[0] ]}" for keys %$bar; } sub kens_code { my $bar; my @keys = keys %$hash_ref; for my $outer (@$array_ref) { my $sum; for my $inner (split /-/, $outer) { for (@keys) { $sum += $foo->{$inner}{$_}{value}; } } push @{$bar->{$sum}}, $outer; } print '*** kens_code ***'; print "@{[ $_, $#{$bar->{$_}}, $bar->{$_}[0] ]}" for keys %$bar; }

      The array, with two million elements, took about 30s (so it's roughly comparable to what the OP describes). My code was typically shaving around 25-30% off of this. I ran it quite a few times — here's a fairly representative run.

      *** op_code *** 1275 1999999 v-w-x-y-z op_code: 31.985272 *** kens_code *** 1275 1999999 v-w-x-y-z kens_code: 23.075756 *** Compare *** kens/op: 72.144941%

      By the way, I totally agree with your comments re $a and $b: I only use those as special variables. I'm not completely averse to single-letter variable names, such as $i for a loop index; although, I do cringe when I find them liberally scattered through production code — meaningful names are a much better choice.

      — Ken

        Hi kcott, thanks for your answer :).

        I also tried inlining the results (now discarded, but it was something like: "sub hrkeys () { keys %$hash_ref }"): that proved to be slower than using "@keys".
        The () prototype makes a function a candidate for inlining, but only if the function doesn't have side effects and returns a constant value. keys can both have side effects (when working on a tied hash), and isn't guaranteed to return a constant value so your function couldn't be inlined.

        I didn't think of caching. That's a good idea, and might put "sum(0) map" back in the picture; however, as you stated, that will depend on the OP's data (which hasn't been shown).
        Well I used your code to try, and the runtime is divided by ten. But that's because all the keys in the outer hash are identical, so the the sum is only computed once, and the cache is used everytime after that. I doubt the OP's data is that redundant :). Here's my code BTW:
        sub eilys_code { my $bar; my %cache; my @keys = keys %$hash_ref; for my $outer (@$array_ref) { my $sum; for my $inner (split /-/, $outer) { $sum = $cache{$inner} //= sum map { $_->{value} } @{ $foo->{ +$inner} }{@keys}; } push @{$bar->{$sum}}, $outer; } print '*** kens_code ***'; print "@{[ $_, $#{$bar->{$_}}, $bar->{$_}[0] ]}" for keys %$bar; }

        The fact that with this sample data all the sums are identical makes this test less representative though, because push @{$bar->{$sum}}, $outer; autovivifies an array only once.

        I'm not completely averse to single-letter variable names, such as $i for a loop index
        I think $x, $y and $z as coordinates are fine. I still try to avoid $i though, because most of the time you don't need a loop index in perl, so if one is needed, there's probably a meaningful name to explain why.

Re: Optimize a perl block
by roboticus (Chancellor) on Sep 06, 2017 at 14:55 UTC

    TruP:

    Is this loop executed once, or are you doing it repeatedly during your program? If the latter, it might be that you can factor the code to move some of the processing out of a tight inner loop. If you can show a bit more of your program (with a *little* sample data) people might be able to dig in a little better to find some optimization opportunities.

    Suppose, for example, that your loop was the scoring section of a genetic algorithm and you were recomputing the score once per each generation. If that were the case, and you were changing the genes of only half your population at a time, you could perhaps retain the scores after you compute them, and then skip the calculation for any item that you've already computed a score for:

    my %scored; . . . foreach my $a (@{ $array_ref }) { if (exists $scored{$a}) { # We've already scored this one push @{ $bar->{$scored{$a}}, $a; next; } my $i = 0; foreach my $b (split('-', $a)) { foreach my $c (keys %{ $hash_ref }) { $i += $foo->{$b}->{$c}->{'value'}; } } $scored{$a} = $i; push @{ $bar->{$i} }, $a; }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Optimize a perl block
by llavaud (Novice) on Sep 08, 2017 at 08:15 UTC
    Hello all,

    I originally post this question on stackexchange the 1st of September (link here), it seems IruP "stole" my question and post-it here and so it was closed on stackexchange. Well it is not very serious.

    To return to the subject; after initial advise on stackexchange i have already replaced keys %{ $hash_ref } by my @keys = keys %$hash_ref;, after this modification it runs about 20% more quickly.

    Now i just tested the Eily tricks with $i += $cache{$b_} //= sum map { $_->{value} } @{ $foo->{$b_} }{@keys}; and it is just really impressive, the gain is again almost 60% ! :)

    To be honest i did not understand anything on what is going on and the condensed form of the code do not help me but i will look at your links and try to understand. Meanwhile if you have time to give me a "simple" explanation of how your code works i will take it ! ;)

    More informations about data:
    * array_ref is about 2 millions entries, keys are unique
    * split of $a give 5 elements
    * hash_ref is small, about 10 entries and keys are unique too
      ok i think i got it, finally it is not so hard to understand, i am just not familiar with the "slice" form @{ $foo->{$b_} }{@keys}.

      thanks for your help !

        Hi llavaud, welcome to the monastery. I'm glad that you got it, but feel free to ask if you need more details.

Re: Optimize a perl block
by Anonymous Monk on Sep 06, 2017 at 14:20 UTC
    How big is %$foo? Does $foo->{$b}{$c}{value} always exist, or are you counting on the undef getting converted to a zero? Is $hash_ref really $foo->{$b}?
      %$foo is relatively small, about 100 keys and $foo->{$b}{$c}{value} always exist yes.

      I don't understand what you mean by: Is $hash_ref really $foo->{$b} ?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1198762]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found