http://www.perlmonks.org?node_id=951676


in reply to Re^5: Recursive locks:killer application. Do they have one? (counting overhead)
in thread Recursive locks:killer application. Do they have one?

I don't see how one can implement blocking locking using a single bit.

Here you go:

#include <windows.h> #include <stdio.h> #include <time.h> #include <process.h> typedef struct { void *protected; int loops; } args; void lock( void *protected ) { while( _interlockedbittestandset64( (__int64*)protected, 0 ) ) { Sleep( 1 ); } } void unlock( void *protected ) { _interlockedbittestandreset64( (__int64*)protected, 0 ); } void worker( void *arg ) { args *a = (args*)arg; int i = 0; for( i=0; i < a->loops; ++i ) { lock( a->protected ); *( (int*)a->protected ) += 2; unlock( a->protected ); } return; } void main( int argc, char **argv ) { int i = 0, nThreads = 4; clock_t start, finish; double elapsed; uintptr_t threads[32]; int shared = 0; args a = { (void *)&shared, 1000000 };; if( argc > 1 ) nThreads = atol( argv[1] ); if( argc > 2 ) a.loops = atol( argv[2] ); printf( "threads:%d loops:%d\n", nThreads, a.loops ); start = clock(); for( i=0; i < nThreads; ++i ) threads[ i ] = _beginthread( &worker, 0, &a ); WaitForMultipleObjects( nThreads, (HANDLE*)&threads, 1, INFINITE ) +; finish = clock(); elapsed = (double)(finish - start) / CLOCKS_PER_SEC; printf( "count: %lu time:%.6f\n", shared, elapsed ); }

And a run with 32 threads all contending to add 2 to a shared integer 1 million times each:

C:\test\lockfree>bitlock 32 1000000 threads:32 loops:1000000 count: 64000000 time:1.332000

And implemented using the simplest primitive possible -- one that will be available in some form on any modern processor.

The other "overhead" you show is so locking can block, so Perl can have a single interface over multiple implementations of blocking locking on multiple operating systems, so Linux can implement a portable blocking locking interface over the current choice of kernel implementation, so Linux can detect deadlock loops, so Linux users can select different types of "wake order" behavior, so somebody can adjust "spin" to reduce context switches in certain scenarios, etc

And therein lies the rub. Perl implements it own recursive locking in terms of pthreads 0.1 primitives. But those "speced" pthreads primitives have long since been superseded on every modern *nix system by vastly more efficient effective and flexible primitives -- eg. futexes -- which already have recursive capabilities.

And then on other platforms -- ie. windows -- the pthreads 0.1 primitives are clumsily emulated using oldest, least effective OS primitives.

Everyone, everywhere is getting big, slow, clumsy emulations of a defunct standard instead of being able to use the modern, efficient, effective mechanisms that have evolved since the pthreads api was frozen in stone.

And all those "so Linux users can" and "so somebody can" are pie-in-the sky, what-ifs and maybes that can never happen for perl users anywhere. Typical, lowest common denominator stuff.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^7: Recursive locks:killer application. Do they have one? (bitlock)
by tye (Sage) on Feb 03, 2012 at 17:12 UTC

    I guess you can call that blocking locking. I call it a spin lock. And it likely requires a bus lock which can be a bad choice if you want high concurrency.

    Nice to see you actually get to the point on efficiency. Sounds like stuff that is worth patching. (Of course, most of that has pretty much nothing to do with counting.)

    Peace.

    - tye        

      I guess you can call that blocking locking. I call it a spin lock.

      I take it that when you said "blocking lock", you really meant a 'waitable lock'.

      Which are required for some uses, but not by any means all. And they suffer from the need for kernel involvement.

      And it likely requires a bus lock which can be a bad choice if you want high concurrency.

      How high is "high concurrency"?

      Here are the results of runs ranging from 1 to 4096 contending threads performing 100 million additions between them:

      C:\test\lockfree>bitlock 1 102400000 threads: 1 loops: 102400000 count: 204800000 time: 4.071000 proce +ssor cycles: 9748471986 C:\test\lockfree>bitlock 4 102400000 threads: 4 loops: 25600000 count: 204800000 time: 4.093000 proce +ssor cycles: 9788765607 C:\test\lockfree>bitlock 16 102400000 threads: 16 loops: 6400000 count: 204800000 time: 4.123000 proce +ssor cycles: 9859762107 C:\test\lockfree>bitlock 64 102400000 threads: 64 loops: 1600000 count: 204800000 time: 4.240000 proce +ssor cycles: 10210001571 C:\test\lockfree>bitlock 256 102400000 threads: 256 loops: 400000 count: 204800000 time: 5.077000 proce +ssor cycles: 11839214313 C:\test\lockfree>bitlock 1024 102400000 threads: 1024 loops: 100000 count: 204800000 time: 5.564000 proce +ssor cycles: 23777289990 C:\test\lockfree>bitlock 4096 102400000 threads: 4096 loops: 25000 count: 204800000 time: 8.286000 proce +ssor cycles: 55413048789

      I think that whether you consider clock time or cpu cycles, that scales pretty damn well, regardless of any bus locks involved.

      For contrast, here are a couple of runs of the same code, same count, but using a waitable (OS) mutex:

      C:\test\lockfree>mutex2 1 102400000 threads: 1 loops: 102400000 count: 204800000 time:73.739000 proce +ssor cycles: 176641186086 C:\test\lockfree>mutex2 4 102400000 threads: 4 loops: 25600000 count: 204800000 time:631.779000 proc +essor cycles: 1247492094759

      Now given the shear scale of the lack of scalability here I hope you'll forgive me for not finishing that series. So to get an indication at higher contention levels, I'll drop the count by an order of magnitude:

      C:\test\lockfree>mutex2 1 10240000 threads: 1 loops: 10240000 count: 20480000 time: 7.412000 proce +ssor cycles: 17760824082 C:\test\lockfree>mutex2 4 10240000 threads: 4 loops: 2560000 count: 20480000 time:63.428000 proce +ssor cycles: 126683354358 C:\test\lockfree>mutex2 16 10240000 threads: 16 loops: 640000 count: 20480000 time:61.973000 proce +ssor cycles: 122496755463 C:\test\lockfree>mutex2 64 10240000 threads: 64 loops: 160000 count: 20480000 time:61.520000 proce +ssor cycles: 121312774089 C:\test\lockfree>mutex2 256 10240000 threads: 256 loops: 40000 count: 20480000 time:63.874000 proce +ssor cycles: 127362344526 C:\test\lockfree>mutex2 1024 10240000 threads: 1024 loops: 10000 count: 20480000 time:73.238000 proce +ssor cycles: 148526579529 C:\test\lockfree>mutex2 4096 10240000 threads: 4096 loops: 2500 count: 20480000 time:86.104000 proce +ssor cycles: 178245340731

      Whilst the scaling doesn't look too bad -- at least once you get over the hump of having some contention -- it is certainly no better than the spinloop. And not only is it doing 1/10 the work, it is taking 10 times longer to do it.

      But the real kicker is the cpu usage. Doing 1/10th the work using the waitable lock requires 3 times the clock cycles of the spinlock. So much for the efficiency of not spinning.

      Ultimately, whether the spinlock provokes bus locks or not is irrelevant, because they are still vastly cheaper than ring 3-0-3 transitions.

      I'll post the code for the above if asked.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?