http://www.perlmonks.org?node_id=972137

JohnRS has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I seek your wisdom.

I have observed something odd regarding multiprocessing performance on Windows. When I run the test below, it seems that there is a * hugh * amount of process switching overhead. When I run the same test on a Linux server it runs as expected (almost no overhead). Here are the results.

############################################################ # # ithreads on Centos Linux, 64 bit, 8 CPU's # Perl v5.10.1 built for x86_64-linux-thread-multi # Threads Clock CPU ==> Speed Overhead # ------- ----- ---- ----- -------- # 1 18.2 18.2 ==> 1.0x 0% # 2 9.1 18.2 ==> 2.0x 0% # 3 6.2 18.3 ==> 2.9x 1% # 5 3.7 18.2 ==> 4.9x 0% # 8 2.4 18.4 ==> 7.6x 1% # # ithreads on my Windows 7, 64 bit, 8 CPU's # Perl v5.12.3 built for MSWin32-x86-multi-thread # Threads Clock CPU ==> Speed Overhead # ------- ----- ---- ----- -------- # 1 25.0 25.0 ==> 1.0x 0% # 2 14.6 28.1 ==> 1.7x 12% # 3 12.9 37.0 ==> 1.9x 48% # 5 9.9 47.8 ==> 2.5x 91% # 8 8.2 62.1 ==> 3.0x 148% # ############################################################

Running a single child process establishes a baseline, 1.0x speed at 0% overhead. With Linux, running 5 processes, I see a 4.9x speed improvement with less than 1% overhead. Very good. But with Windows, running 5 processes, I see only a 2.5x speed improvement with about 91% overhead! In other words, the speed improvement was only about half of what it should have been and the CPU time almost doubled. What was the CPU doing this extra 91% of the time?

I realize that the test results aren't very accurate (about 10%). I ran them on live, but mostly idle, machines. The deviations in the Windows results are much more than 10%, however, so I think that they are relevant. Here is the test code.

use strict; use warnings; use threads; use Time::HiRes 'time'; my $nr_children = 1; my @threads; my $start = time; foreach my $i (1 .. $nr_children) { $threads[$i] = threads->create(\&Work, $i); } foreach my $i (1 .. $nr_children) { $threads[$i]->join(); } my $stop = time - $start; printf "\nclock: %.1f sec\n", $stop; my @run = times; printf "user: %.1f sec\n", $run[0]; exit; ##### sub Work { my ($i) = @_; foreach ( 1 .. (20e5/$nr_children) ) { my $acct_nrs = "abc\txyz\tdef\tabc\tghi\tghi"; my @temp = split(m/\t/, $acct_nrs, -1); @temp = ( sort keys %{{ map { $_ => 1 } @temp }} ); my $ans = join(', ', @temp); } print " $i"; return; }

The processes run compute bound and keep all 8 CPU's (when using 8 child processes) at 100% simultaneously, both on Windows and Linux. There is no I/O (except one print at the end), no blocking, no locking, and no shared memory. The processes last long enough that the setup time shouldn't be very important. Thus I'm left thinking that the overhead would be due to process switching by the operating system.

This test uses ithreads. I also ran a similar test using forks and the results in both cases, Linux and Windows, were almost identical to the itread results.

I realize that if the processes were normally blocked this wouldn't be as big an issue. But my job is compute bound. So event loops (POE, Coro, etc) wouldn't help. Not even POE's "Wheel", which uses fork, from what I read.

In summary, my questions are: 1) Is my test valid? 2) Is my conclusion valid? 3) Is there a way to get better multiprocessing performance on Windows?

Thanks, John.

Replies are listed 'Best First'.
Re: Multiprocessing on Windows (Cannot reproduce!)
by BrowserUk (Patriarch) on May 24, 2012 at 01:12 UTC
    In summary, my questions are:

    Are you running Windows in a VM? That's the only thing that comes to mind that might explain your results. I cannot reproduce them at all

    I only have 4 cores, and the result from your script (slightly tweaked) show an almost perfect split of processing:

    #! perl -slw use strict; use threads; use Time::HiRes 'time'; our $T //= 4; my @threads; my $start = time; foreach my $i (1 .. $T ) { $threads[$i] = threads->create(\&Work, $i); } foreach my $i ( 1 .. $T ) { $threads[$i]->join(); } my $stop = time - $start; printf "\nclock: %f sec user: %f\n", $stop, (times())[0]; exit; ##### sub Work { my ($i) = @_; foreach ( 1 .. ( 20e5 / $T ) ) { my $acct_nrs = "abc\txyz\tdef\tabc\tghi\tghi"; my @temp = split(m/\t/, $acct_nrs, -1); @temp = ( sort keys %{{ map { $_ => 1 } @temp }} ); my $ans = join(', ', @temp); } printf " $i"; return; } __END__ C:\test>for /l %i in (1,1,4) do @972137 -T=%i 1 clock: 29.351637 sec user: 29.093000 1 2 clock: 14.986346 sec user: 29.765000 1 2 3 clock: 10.131188 sec user: 29.968000 2 3 4 1 clock: 7.781729 sec user: 29.796000

    Update: Ditto for 5.14:

    C:\test>for /l %i in (1,1,4) do @\perl64-14\bin\perl.exe -slw 972137.p +l -T=%i 1 clock: 27.309776 sec user: 27.343000 2 1 clock: 13.878018 sec user: 27.625000 2 1 3 clock: 9.370494 sec user: 27.875000 3 2 1 4 clock: 7.205594 sec user: 27.765000

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      No, I'm just running straight under Windows 7.

      When I run your version I get the same results as I did originally. So perhaps there is something strange going on with my machine. I will tear into it and see what I can find.

      I appreciate your help!

        Followup: I figured it out. It wasn't anything wrong with my machine. It's just how it works.

        I'm running a I7 840QM processor in a laptop. When you run just one or two CPU's, Turbo mode kicks in and boosts your speed by about 71%. So my "baseline" measurement of 25 seconds should really have been corrected to 43 seconds. Indeed, this matches almost perfectly with my 4 thread test results of 10.8 clock time and 42.1 user time.

        Then the amount of CPU power changes. Hyper Threading doesn't really give you twice as much crunch power. It depends on what you are doing, but in this case it gave me about 50%. Again, this explains why the CPU time went from 42 to 62 (50% more) when running the test for 4 and 8 threads, rather than remaining constant.

        Heat is definitely a bummer and it's worse in my laptop than it would be in a desktop or server. So overall I get about a 5.5x speed increase when running 8 threads instead of the full 8x increase.

        Fortunately, I'll be running the actual job on a real server. I tested it and I see about a 7.8x speed increase when running the 8 thread test on it. So all is well!

        Thank you again for your help. It put me on the right path to understand what was going on.