Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^5: Recursive locks:killer application. Do they have one? (counting overhead)

by tye (Cardinal)
on Feb 03, 2012 at 04:35 UTC ( #951599=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Recursive locks:killer application. Do they have one? (mu)
in thread Recursive locks:killer application. Do they have one?

Yes, counting requires a tiny bit of space for the counter. Obviously. Even if that doubled the space overhead of the mutex, it would be quite the rare situation when that would matter to me.

The overhead of the counting indeed looks pretty trivial to me. We are only talking about the following:

... I32 locks; ... lock->locks++; ... lock->locks = 1; ...

The other "overhead" you show is so locking can block, so Perl can have a single interface over multiple implementations of blocking locking on multiple operating systems, so Linux can implement a portable blocking locking interface over the current choice of kernel implementation, so Linux can detect deadlock loops, so Linux users can select different types of "wake order" behavior, so somebody can adjust "spin" to reduce context switches in certain scenarios, etc.

Which, when you realise that a non-recursive lock can be built atop a single bit, starts to look just a little indulgent.

I don't see how one can implement blocking locking using a single bit. I certainly see some overhead that could be eliminated for the particular environment you are looking at. I suspect such would require more 'ifdef' work and thus create more complexity at other layers (assuming that dropping support for other platforms is not allowed) for the sake of conditionally reducing some run-time complexity in some environments. It is possible that such might even have a noticeable benefit in such environments.

Getting rid of the counting there isn't going to make much difference. But surely you are instead implementing a new alternative, so not bothering to implement counting seems worth considering. I don't even see any advantage to exposing an API that would interfere with deciding to add support for counting at some later date (which means not bothering to implement it up front is probably wise).

- tye        


Comment on Re^5: Recursive locks:killer application. Do they have one? (counting overhead)
Download Code
Re^6: Recursive locks:killer application. Do they have one? (bitlock)
by BrowserUk (Pope) on Feb 03, 2012 at 16:27 UTC
    I don't see how one can implement blocking locking using a single bit.

    Here you go:

    #include <windows.h> #include <stdio.h> #include <time.h> #include <process.h> typedef struct { void *protected; int loops; } args; void lock( void *protected ) { while( _interlockedbittestandset64( (__int64*)protected, 0 ) ) { Sleep( 1 ); } } void unlock( void *protected ) { _interlockedbittestandreset64( (__int64*)protected, 0 ); } void worker( void *arg ) { args *a = (args*)arg; int i = 0; for( i=0; i < a->loops; ++i ) { lock( a->protected ); *( (int*)a->protected ) += 2; unlock( a->protected ); } return; } void main( int argc, char **argv ) { int i = 0, nThreads = 4; clock_t start, finish; double elapsed; uintptr_t threads[32]; int shared = 0; args a = { (void *)&shared, 1000000 };; if( argc > 1 ) nThreads = atol( argv[1] ); if( argc > 2 ) a.loops = atol( argv[2] ); printf( "threads:%d loops:%d\n", nThreads, a.loops ); start = clock(); for( i=0; i < nThreads; ++i ) threads[ i ] = _beginthread( &worker, 0, &a ); WaitForMultipleObjects( nThreads, (HANDLE*)&threads, 1, INFINITE ) +; finish = clock(); elapsed = (double)(finish - start) / CLOCKS_PER_SEC; printf( "count: %lu time:%.6f\n", shared, elapsed ); }

    And a run with 32 threads all contending to add 2 to a shared integer 1 million times each:

    C:\test\lockfree>bitlock 32 1000000 threads:32 loops:1000000 count: 64000000 time:1.332000

    And implemented using the simplest primitive possible -- one that will be available in some form on any modern processor.

    The other "overhead" you show is so locking can block, so Perl can have a single interface over multiple implementations of blocking locking on multiple operating systems, so Linux can implement a portable blocking locking interface over the current choice of kernel implementation, so Linux can detect deadlock loops, so Linux users can select different types of "wake order" behavior, so somebody can adjust "spin" to reduce context switches in certain scenarios, etc

    And therein lies the rub. Perl implements it own recursive locking in terms of pthreads 0.1 primitives. But those "speced" pthreads primitives have long since been superseded on every modern *nix system by vastly more efficient effective and flexible primitives -- eg. futexes -- which already have recursive capabilities.

    And then on other platforms -- ie. windows -- the pthreads 0.1 primitives are clumsily emulated using oldest, least effective OS primitives.

    Everyone, everywhere is getting big, slow, clumsy emulations of a defunct standard instead of being able to use the modern, efficient, effective mechanisms that have evolved since the pthreads api was frozen in stone.

    And all those "so Linux users can" and "so somebody can" are pie-in-the sky, what-ifs and maybes that can never happen for perl users anywhere. Typical, lowest common denominator stuff.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I guess you can call that blocking locking. I call it a spin lock. And it likely requires a bus lock which can be a bad choice if you want high concurrency.

      Nice to see you actually get to the point on efficiency. Sounds like stuff that is worth patching. (Of course, most of that has pretty much nothing to do with counting.)

      Peace.

      - tye        

        I guess you can call that blocking locking. I call it a spin lock.

        I take it that when you said "blocking lock", you really meant a 'waitable lock'.

        Which are required for some uses, but not by any means all. And they suffer from the need for kernel involvement.

        And it likely requires a bus lock which can be a bad choice if you want high concurrency.

        How high is "high concurrency"?

        Here are the results of runs ranging from 1 to 4096 contending threads performing 100 million additions between them:

        C:\test\lockfree>bitlock 1 102400000 threads: 1 loops: 102400000 count: 204800000 time: 4.071000 proce +ssor cycles: 9748471986 C:\test\lockfree>bitlock 4 102400000 threads: 4 loops: 25600000 count: 204800000 time: 4.093000 proce +ssor cycles: 9788765607 C:\test\lockfree>bitlock 16 102400000 threads: 16 loops: 6400000 count: 204800000 time: 4.123000 proce +ssor cycles: 9859762107 C:\test\lockfree>bitlock 64 102400000 threads: 64 loops: 1600000 count: 204800000 time: 4.240000 proce +ssor cycles: 10210001571 C:\test\lockfree>bitlock 256 102400000 threads: 256 loops: 400000 count: 204800000 time: 5.077000 proce +ssor cycles: 11839214313 C:\test\lockfree>bitlock 1024 102400000 threads: 1024 loops: 100000 count: 204800000 time: 5.564000 proce +ssor cycles: 23777289990 C:\test\lockfree>bitlock 4096 102400000 threads: 4096 loops: 25000 count: 204800000 time: 8.286000 proce +ssor cycles: 55413048789

        I think that whether you consider clock time or cpu cycles, that scales pretty damn well, regardless of any bus locks involved.

        For contrast, here are a couple of runs of the same code, same count, but using a waitable (OS) mutex:

        C:\test\lockfree>mutex2 1 102400000 threads: 1 loops: 102400000 count: 204800000 time:73.739000 proce +ssor cycles: 176641186086 C:\test\lockfree>mutex2 4 102400000 threads: 4 loops: 25600000 count: 204800000 time:631.779000 proc +essor cycles: 1247492094759

        Now given the shear scale of the lack of scalability here I hope you'll forgive me for not finishing that series. So to get an indication at higher contention levels, I'll drop the count by an order of magnitude:

        C:\test\lockfree>mutex2 1 10240000 threads: 1 loops: 10240000 count: 20480000 time: 7.412000 proce +ssor cycles: 17760824082 C:\test\lockfree>mutex2 4 10240000 threads: 4 loops: 2560000 count: 20480000 time:63.428000 proce +ssor cycles: 126683354358 C:\test\lockfree>mutex2 16 10240000 threads: 16 loops: 640000 count: 20480000 time:61.973000 proce +ssor cycles: 122496755463 C:\test\lockfree>mutex2 64 10240000 threads: 64 loops: 160000 count: 20480000 time:61.520000 proce +ssor cycles: 121312774089 C:\test\lockfree>mutex2 256 10240000 threads: 256 loops: 40000 count: 20480000 time:63.874000 proce +ssor cycles: 127362344526 C:\test\lockfree>mutex2 1024 10240000 threads: 1024 loops: 10000 count: 20480000 time:73.238000 proce +ssor cycles: 148526579529 C:\test\lockfree>mutex2 4096 10240000 threads: 4096 loops: 2500 count: 20480000 time:86.104000 proce +ssor cycles: 178245340731

        Whilst the scaling doesn't look too bad -- at least once you get over the hump of having some contention -- it is certainly no better than the spinloop. And not only is it doing 1/10 the work, it is taking 10 times longer to do it.

        But the real kicker is the cpu usage. Doing 1/10th the work using the waitable lock requires 3 times the clock cycles of the spinlock. So much for the efficiency of not spinning.

        Ultimately, whether the spinlock provokes bus locks or not is irrelevant, because they are still vastly cheaper than ring 3-0-3 transitions.

        I'll post the code for the above if asked.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://951599]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2014-09-20 22:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (163 votes), past polls