Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

for v. map v. pdl

by punkish (Priest)
on May 27, 2011 at 21:56 UTC ( #907074=perlquestion: print w/replies, xml ) Need Help??
punkish has asked for the wisdom of the Perl Monks concerning the following question:

In my quest to understand speed differences between different techniques for a simple task, wrote the following, very contrived, script.

use Benchmark qw(:all) ; use PDL; my @runs = ( [1, 10_000_000], [10_000, 1000], [1000, 10_000], [10_000_000, 1] ); for my $run (@runs) { my $count = $run->[0]; my $loops = $run->[1]; print "Test with count: $count, loops: $loops\n" . "-" x 50 . "\n" +; timethese($count, { 'for' => sub { my @out = (); for (1 .. $loops) { push @out, $_ + * $_;} }, 'map' => sub { my @out = map { $_ * $_ } (1 .. $loops); }, 'pdl' => sub { my $in = pdl(1 .. $loops); my $out = $in * $in; + } }); }

The script tries to compare the three techniques on a spectrum of memory intensive to cpu intensive tasks. The results are below

Test with count: 1, loops: 10000000 -------------------------------------------------- Benchmark: timing 1 iterations of for, map, pdl... for: 2 wallclock secs ( 2.12 usr + 0.29 sys = 2.41 CPU) @ 0 +.41/s (n=1) (warning: too few iterations for a reliable count) map: 5 wallclock secs ( 3.86 usr + 1.15 sys = 5.01 CPU) @ 0 +.20/s (n=1) (warning: too few iterations for a reliable count) pdl: 17 wallclock secs ( 2.83 usr + 0.70 sys = 3.53 CPU) @ 0 +.28/s (n=1) (warning: too few iterations for a reliable count) Test with count: 10000, loops: 1000 -------------------------------------------------- Benchmark: timing 10000 iterations of for, map, pdl... for: 2 wallclock secs ( 1.86 usr + 0.00 sys = 1.86 CPU) @ 53 +76.34/s (n=10000) map: 3 wallclock secs ( 3.37 usr + 0.00 sys = 3.37 CPU) @ 29 +67.36/s (n=10000) pdl: 3 wallclock secs ( 2.15 usr + 0.01 sys = 2.16 CPU) @ 46 +29.63/s (n=10000) Test with count: 1000, loops: 10000 -------------------------------------------------- Benchmark: timing 1000 iterations of for, map, pdl... for: 1 wallclock secs ( 1.84 usr + 0.00 sys = 1.84 CPU) @ 54 +3.48/s (n=1000) map: 4 wallclock secs ( 3.39 usr + 0.00 sys = 3.39 CPU) @ 29 +4.99/s (n=1000) pdl: 2 wallclock secs ( 1.98 usr + 0.01 sys = 1.99 CPU) @ 50 +2.51/s (n=1000) Test with count: 10000000, loops: 1 -------------------------------------------------- Benchmark: timing 10000000 iterations of for, map, pdl... for: 10 wallclock secs ( 8.82 usr + 0.01 sys = 8.83 CPU) @ 11 +32502.83/s (n=10000000) map: 7 wallclock secs ( 6.40 usr + -0.01 sys = 6.39 CPU) @ 15 +64945.23/s (n=10000000) pdl: 201 wallclock secs (198.35 usr + 0.28 sys = 198.63 CPU) @ + 50344.86/s (n=10000000)

Some aspects of the above are not surprising. There is a considerable overhead, I am guessing, in setting up a piddle, so setting up lots of small piddles many times is expensive. But, I was expecting, perhaps wrongly, that map would always be cheaper than for. Dunno why I believed that looping is always expensive, although, I wonder if under the hood, map is looping really not just effectively.

Please explain, gently, if possible. Thanks in advance.

Bonus question -- what exactly is the difference between the usr and the sys seconds in the Benchmark reported timings?

Note: In the PDL test I am not really ending up with an out array. Instead, I am getting a piddle. If I convert $out to @out using PDL's list, it just kills PDL.

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: for v. map v. pdl
by davido (Archbishop) on May 28, 2011 at 07:17 UTC

    The tests do a poor job of isolating any pure differences in execution speed of map, for, and pdl. Too much of the time differences can be attributed to other factors such as memory criteria.

    But I see another issue here. If profiling determines that loops are your bottleneck, there are a number of approaches which may or may not be applicable to your situation:

    • Find a better algorithm. If computational resources as described in Big-O notation can be moved along from one order of magnitude to some lesser order of magnitude, you win the battle. But better algorithms may not exist, in which case....
    • Approximation: Is it acceptable to generate an approximation that is an order of magnitude less computationally intensive?
    • Limit: Is it acceptable to generate only a portion of the solution? (Databases, for example, may use the "LIMIT 100" clause to prevent grabbing thousands of rows)
    • Parallel: Can you fork four processes, each running on a separate core in a 4-core processor, each working on a portion of the solution?
    • Caching: Can you generate the solution once and store it for future use?
    • Distributed: Can you distribute the task among a larger group of machines, each working on a portion of the solution? (This is similar to parallel, but on a macro scale rather than a micro scale, to borrow terms from Economics).
    • Get closer to the machine: Use inline-C for optimized execution of a few lines. However, Perl is already well optimized albeit usually in ways generically applicable. If you have the means of creating c-code that results in better optimization, you may gain efficiency. Note however, something that doesn't scale well in Perl probably doesn't scale well in C either. Think of it this way: If a car has a top speed of 70mph, and another of 80mph, your 80 mile drive is going to take 1hr8min in the slow car, and 1hr in the fast car. Meanwhile, Captain Kirk beamed himself there in 5 seconds because Spock developed a better algorithm.
    • Computer speed: Along the same lines of "Coding in C" is "Get a faster computer." This may gain you a little, but there's a limit to how much you can get out of it.
    • And way down at the bottom of the list comes decide between 'for' or 'map'.

    This list is in no particular order, except that I listed 'finding a better algorithm' first because it's the purest option from a computer science standpoint. And I listed "choose between 'for' and 'map'" last, as I see it as the least rewarding (and least pure from a CS view) option.

    I guess what I'm saying is that selecting 'for' versus 'map' is such a limited gain that it's really a last resort. Particularly since there's no guarantee that a future version of Perl wouldn't improve optimizations on one more than on another in a way that breaks your original assertions. If the efficiency different between 'map' and 'for' are not defined, they're not guaranteed to stay the same.


Re: for v. map v. pdl
by BrowserUk (Pope) on May 27, 2011 at 23:02 UTC
    what exactly is the difference between the usr and the sys seconds in the Benchmark reported timings?

    In this particular example, the thing that will account for most of the system time is the program requesting more virtual memory (mostly heap) from the OS.

    This is why the single loop over a large array spend a substantially larger proportion of the total time in systime relative to the large number of iterations over a small array.

    In the former, the large amount of memory is allocated once and used once.

    In the latter, the smaller amount of memory is allocated once for the first run and then reused many times.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: for v. map v. pdl
by Anonymous Monk on May 27, 2011 at 22:22 UTC
    Bonus question -- what exactly is the difference between the usr and the sys seconds in the Benchmark reported timings?


    CPU seconds is, in UNIX terms, the user time plus the system time of the process itself, as opposed to the real (wallclock) time and the time spent by the child processes.
    The term 'user CPU time' can be a bit misleading at first. To be clear, the total time (real CPU time) is the combination of the amount of time the CPU spends performing some action for a program and the amount of time the CPU spends performing system calls for the kernel on the program's behalf. When a program loops through an array, it is accumulating user CPU time. Conversely, when a program executes a system call such as exec or fork, it is accumulating system CPU time.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://907074]
Approved by ww
Front-paged by ww
LanX for they know not what they do ...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (11)
As of 2017-10-20 11:15 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (261 votes). Check out past polls.