in reply to Re^10: Use perl type without perl in thread Use perl type without perl
First, thank you for arguing with numbers. It is a rare event and most welcomed.
But --- you knew that was coming right -- your benchmark:
- Has to use a huge multiplier -- 1 billion iterations -- to make a mountain out of a molehill.
Let's say the total runtime was the upper of your vague estimate. 8 seconds.
Which means:
- 'cxt' took 1.5 seconds for 1e9 operations = 0.0000000015 s/iteration.
- 'noctx' took 6.5 seconds for 1e9 operations = 0.0000000065 s/iteration.
By any body's standards, a whole 5 billionths of a second difference is hardly "huge". (Which was your assertion).
And if the body of the loop did anything useful -- like call one or two of the huge macros or long twisty functions that are the reason for having the context within the sub in the first place --then those 5 nanoseconds would just disappear into the noise.
- In the noctx, you are using the equally flawed Perl_get_context()
Which, as you point out, entirely swamps the call to TLSGetValue(), by bracketing it with (useless*) calls to GetLastError() and SetLastError().
As we discussed before, what Last error are they preserving, that is important enough to be preserved, not important enough to be reported straight away?
And, if there is justification for preserving some system errors whilst ignoring other, why preserve them in OS memory thus requiring every unimportant system call to be bracketed with GLE/SLE? Why not get the error just after the important system call that caused it and put it somewhere local?
That way, you do one GetLastError() call after each (significant) system call that you want to preserve; rather than bracketing every insignificant system call with two other system calls.
My prime suspect for why TLSGetValue() doesn't get inlined, is the fact that it is bracketed by those other two calls. I'd love to see you add a 3rd test to your benchmark that calls TLSGetValue() directly. I'm not saying it will be inlined, but even if it isn't, it would reduce the (already nanoscopic) difference quite considerably.
Most significantly -- you've tested something quite different to that I was trying to describe.
The reason functions need to have visibility of the context, is because some of the functions they call, require it be passed to them.
This requirement is often hidden by wrapping the functions that need it in macros. You know better than I do how grossly unwieldy many of the wrapper macros get.
There is a common pattern to many of the worst ones, that goes something like this: #define SOMETHING1 STMT_START { assert( something ); if(some_complex_c
+ondition) wrapped_function1( aTHX_, ... ); assert(something_else ) }
+STMT_END
#define SOMETHING2 STMT_START { assert( something ); if(some_complex_c
+ondition) wrapped_function2( aTHX_, ... ); assert(something_else ) }
+STMT_END
#define SOMETHING3 STMT_START { assert( something ); if(some_complex_c
+ondition) wrapped_function3( aTHX_, ... ); assert(something_else ) }
+STMT_END
int someFunction( aTHX_ ... ) {
dATHX;
...;
SOMETHING1( ... );
...;
SOMETHING2( ... );
...;
SOMETHING3( ... );
RETURN;
}
The logic being (I assume) that by testing the conditions inline, you prevent the call overhead for the cases where the condition(s) fail.
But a simple test shows that it isn't the case:
With x1(), 50% of calls are avoided by an inline conditional test.
With x2(), that test is moved into the body of the function, which returns immediately if the test fails.
Compile & run: C:\test>cl /Ox calloverhead.c
Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
calloverhead.c
Microsoft (R) Incremental Linker Version 9.00.21022.08
Copyright (C) Microsoft Corporation. All rights reserved.
/out:calloverhead.exe
calloverhead.obj
C:\test>calloverhead 10000000
Inline condition: 60,068,106
Inbody condition: 45,064,458
C:\test>calloverhead 10000000
Inline condition: 60,037,515
Inbody condition: 45,084,879
C:\test>calloverhead 10000000
Inline condition: 60,048,828
Inbody condition: 45,057,681
C:\test>calloverhead 10000000
Inline condition: 60,032,691
Inbody condition: 45,032,724
The inline condition takes 1/3rd more cycles than putting the test inside the body of the function call!
And if the conditional tests are inside the body of the functions, you no longer need the macro wrappers -- which makes things a lot clearer for the programmer.
And you also don't need access to the context in all the callers of the wrapped functions, so then the called function can obtain the context internally, thus removing it from visibility at the caller's level.
And the code size shrinks because the conditional test appears once inside the function rather than at every call site.
That's a 3 way win, with no downsides.
The point is that you cannot take one single aspect of the overall vision, mock it up into a highly artificial benchmark and draw conclusions. You have to consider the entire picture.
Of course, it is never going to happen, so there is little point in arguing about it; but if you did effect this kind of change throughout the code base; along with all the other stuff we discussed elsewhere; the effects can be significant.
The hope for using LLVM to compile the Perl runtime, is that by re-writing the macro-infested C sources to IR, and combining them with current compilation unit of Perl code that uses it -- also suitably compiled to IR; it can see through both the macros and the disjoint runloop, and find optimisations on a case-by-case basis that cannot be made universally.
That is to say, (by way of example), a piece of code that uses no magic, and only IVs or UVs, may qualify for optimisations that could not be made statically by a C compiler, because -- given the current structure of the pp_* opcode functions -- it could never possibly see them; as it always has to allow for the possibility of magic; and NVs; and PVs; et al.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
Re^12: Use perl type without perl
by bulk88 (Priest) on Sep 27, 2012 at 08:09 UTC
|
By any body's standards, a whole 5 billionths of a second difference is hardly "huge". (Which was your assertion).
And if the body of the loop did anything useful -- like call one or two of the huge macros or long twisty functions that are the reason for having the context within the sub in the first place --then those 5 nanoseconds would just disappear into the noise.
The 5 nanoseconds can not disappear into the noise. They are not free. If they ran, they cost time. Whether it is 1 ns vs 5 ns, or 1 ms vs 5 ms, or 1 minute vs 5 minutes, the smaller choice is better. Lets see how many times Perl_get_context is called for the simplest of Perl programs. PGC=Perl_get_context.
C:\p517\perl\win32>perl -e "for(0..5) {print 'hello world'.\"\n\";};"
hello world
hello world
hello world
hello world
hello world
hello world
PGC count 1096
C:\p517\perl\win32>perl -e "print 'hello world'";
hello worldPGC count 1069
C:\p517\perl\win32>perl -e "system('pause');"
Press any key to continue . . .
PGC count 1133
C:\p517\perl\win32>
Now lets try to compile an XS module.
C:\Documents and Settings\Owner\Desktop\cpan libs\Win32API\g>perl make
+file.pl
Checking if your kit is complete...
Warning: the following files are missing in your kit:
api-test/Release/API_test.dll
api-test/Release/API_test.lib
Please inform the author.
Writing Makefile for Win32::API::Callback
Writing MYMETA.yml and MYMETA.json
Writing Makefile for Win32::API
Writing MYMETA.yml and MYMETA.json
PGC count 337157
C:\Documents and Settings\Owner\Desktop\cpan libs\Win32API\g>nmake ins
+tall
Microsoft (R) Program Maintenance Utility Version 7.10.3077
Copyright (C) Microsoft Corporation. All rights reserved.
PGC count 29225
PGC count 29077
PGC count 29098
PGC count 29131
PGC count 29077
PGC count 29098
PGC count 29228
PGC count 29077
PGC count 29098
PGC count 29229
PGC count 29077
PGC count 29098
PGC count 29131
PGC count 29077
PGC count 29098
PGC count 29131
PGC count 29077
PGC count 29098
PGC count 29131
PGC count 29077
PGC count 29098
PGC count 29131
PGC count 29077
PGC count 29098
cp Type.pm blib\lib\Win32/API/Type.pm
cp Callback.pm blib\lib\Win32/API/Callback.pm
cp Test.pm blib\lib\Win32/API/Test.pm
cp Struct.pm blib\lib\Win32/API/Struct.pm
cp API.pm blib\lib\Win32/API.pm
cp IATPatch.pod blib\lib\Win32/API/Callback/IATPatch.pod
PGC count 135573
PGC count 29098
nmake -f Makefile all -nologo
PGC count 29084
PGC count 29077
PGC count 29098
PGC count 29132
PGC count 29077
PGC count 29098
PGC count 29132
PGC count 29077
PGC count 29098
PGC count 29098
C:\perl517\bin\perl.exe C:\perl517\lib\ExtUtils\xsubpp -typem
+ap C:\perl
517\lib\ExtUtils\typemap Callback.xs > Callback.xsc && C:\perl517\bin
+\perl.exe
-MExtUtils::Command -e mv -- Callback.xsc Callback.c
PGC count 153957
PGC count 29098
cl -c -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -G7 -DWIN32 -D_C
+ONSOLE -DN
O_STRICT -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLIC
+IT_SYS -DU
SE_PERLIO -D_USE_32BIT_TIME_T -MD -Zi -DNDEBUG -O1 -G7 -DVERSION=\"
+0.71\" -D
XS_VERSION=\"0.71\" "-IC:\perl517\lib\CORE" Callback.c
Callback.c
Running Mkbootstrap for Win32::API::Callback ()
PGC count 1097
PGC count 11211
PGC count 29098
C:\perl517\bin\perl.exe -MExtUtils::Command -e chmod -- 644 Ca
+llback.bs
PGC count 29077
C:\perl517\bin\perl.exe -MExtUtils::Mksymlists -e "Mksymlists
+('NAME'=>\
"Win32::API::Callback\", 'DLBASE' => 'Callback', 'DL_FUNCS' => { }, '
+FUNCLIST'
=> [], 'IMPORTS' => { }, 'DL_VARS' => []);"
PGC count 13035
link -out:..\blib\arch\auto\Win32\API\Callback\Callback.dll -d
+ll -nologo
-nodefaultlib -debug -opt:ref,icf -libpath:"c:\perl517\lib\CORE" -m
+achine:x86
Callback.obj C:\perl517\lib\CORE\perl517.lib oldnames.lib kernel32.
+lib user32
.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole
+32.lib ole
aut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version
+.lib odbc3
2.lib odbccp32.lib comctl32.lib msvcrt.lib -def:Callback.def
Creating library ..\blib\arch\auto\Win32\API\Callback\Callback.lib
+and object
..\blib\arch\auto\Win32\API\Callback\Callback.exp
if exist ..\blib\arch\auto\Win32\API\Callback\Callback.dll.man
+ifest mt -
nologo -manifest ..\blib\arch\auto\Win32\API\Callback\Callback.dll.man
+ifest -out
putresource:..\blib\arch\auto\Win32\API\Callback\Callback.dll;2
if exist ..\blib\arch\auto\Win32\API\Callback\Callback.dll.man
+ifest del
..\blib\arch\auto\Win32\API\Callback\Callback.dll.manifest
C:\perl517\bin\perl.exe -MExtUtils::Command -e chmod -- 755 ..
+\blib\arch
\auto\Win32\API\Callback\Callback.dll
PGC count 29077
PGC count 35327
C:\perl517\bin\perl.exe -MExtUtils::Command -e cp -- Callback.
+bs ..\blib
\arch\auto\Win32\API\Callback\Callback.bs
PGC count 37097
C:\perl517\bin\perl.exe -MExtUtils::Command -e chmod -- 644 ..
+\blib\arch
\auto\Win32\API\Callback\Callback.bs
PGC count 29077
cd ..
C:\perl517\bin\perl.exe C:\perl517\lib\ExtUtils\xsubpp -nolin
+enumbers
-typemap C:\perl517\lib\ExtUtils\typemap -typemap typemap API.xs > AP
+I.xsc && C
:\perl517\bin\perl.exe -MExtUtils::Command -e mv -- API.xsc API.c
PGC count 156717
PGC count 29098
cl -c -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -G7 -DWIN32 -D_C
+ONSOLE -DN
O_STRICT -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLIC
+IT_SYS -DU
SE_PERLIO -D_USE_32BIT_TIME_T -MD -Zi -DNDEBUG -O1 -G7 -DVERSION=\"
+0.71\" -D
XS_VERSION=\"0.71\" "-IC:\perl517\lib\CORE" API.c
API.c
c:\Documents and Settings\Owner\Desktop\cpan libs\Win32API\g\call_i686
+.h(44) : w
arning C4101: 'pReturn' : unreferenced local variable
API.c(341) : warning C4047: '=' : 'SV *' differs in levels of indirect
+ion from '
SV ** '
Running Mkbootstrap for Win32::API ()
PGC count 1097
PGC count 11209
PGC count 29097
C:\perl517\bin\perl.exe -MExtUtils::Command -e chmod -- 644 AP
+I.bs
PGC count 29077
C:\perl517\bin\perl.exe -MExtUtils::Mksymlists -e "Mksymlists
+('NAME'=>\
"Win32::API\", 'DLBASE' => 'API', 'DL_FUNCS' => { }, 'FUNCLIST' => []
+, 'IMPORTS
' => { }, 'DL_VARS' => []);"
PGC count 13027
link -out:blib\arch\auto\Win32\API\API.dll -dll -nologo -nodef
+aultlib -d
ebug -opt:ref,icf -libpath:"c:\perl517\lib\CORE" -machine:x86 API.ob
+j C:\per
l517\lib\CORE\perl517.lib oldnames.lib kernel32.lib user32.lib gdi32.l
+ib winspoo
l.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib n
+etapi32.li
b uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp
+32.lib com
ctl32.lib msvcrt.lib -def:API.def
Creating library blib\arch\auto\Win32\API\API.lib and object blib\a
+rch\auto\W
in32\API\API.exp
if exist blib\arch\auto\Win32\API\API.dll.manifest mt -nologo
+-manifest
blib\arch\auto\Win32\API\API.dll.manifest -outputresource:blib\arch\au
+to\Win32\A
PI\API.dll;2
if exist blib\arch\auto\Win32\API\API.dll.manifest del blib\ar
+ch\auto\Wi
n32\API\API.dll.manifest
C:\perl517\bin\perl.exe -MExtUtils::Command -e chmod -- 755 bl
+ib\arch\au
to\Win32\API\API.dll
PGC count 29077
PGC count 35325
C:\perl517\bin\perl.exe -MExtUtils::Command -e cp -- API.bs bl
+ib\arch\au
to\Win32\API\API.bs
PGC count 37097
C:\perl517\bin\perl.exe -MExtUtils::Command -e chmod -- 644 bl
+ib\arch\au
to\Win32\API\API.bs
PGC count 29077
Files found in blib\arch: installing files in blib\lib into architectu
+re depende
nt library tree
Installing C:\perl517\site\lib\auto\Win32\API\API.dll
Installing C:\perl517\site\lib\auto\Win32\API\API.exp
Installing C:\perl517\site\lib\auto\Win32\API\API.lib
Installing C:\perl517\site\lib\auto\Win32\API\API.pdb
Installing C:\perl517\site\lib\auto\Win32\API\Callback\Callback.dll
Installing C:\perl517\site\lib\auto\Win32\API\Callback\Callback.exp
Installing C:\perl517\site\lib\auto\Win32\API\Callback\Callback.lib
Installing C:\perl517\site\lib\auto\Win32\API\Callback\Callback.pdb
PGC count 140759
PGC count 6563
Appending installation info to c:\perl517\lib/perllocal.pod
PGC count 1103
PGC count 29084
PGC count 6653
C:\Documents and Settings\Owner\Desktop\cpan libs\Win32API\g>
Highest per process Perl_get_context count was 337157, assuming 4 ns wasted in PGC vs C stack my_perl, 1.34 ms of CPU was wasted in that process. nmake install took 11 seconds by the clock on the wall. Added all PGCs counts, comes to 2472747, * 4 ns is 9.8 ms. I am aware that 9.8ms is only %0.08 of the 11 seconds it took to run nmake install. I still say death by a 1000 cuts!
Now about TLSGetValue directly.
C:\Documents and Settings\Owner\Desktop\cpan libs\lxs>perl -MLocal::XS
+ -e "Local::
XS::CxtSpeed();"
no cxt 47237378
cxt 13202140
no cxt no last error 22624295
C:\Documents and Settings\Owner\Desktop\cpan libs\lxs>
Remember to get PL_thr_key is 2 asm dereferences in XS DLL vs 1 inside the interp itself.
My next reply will address your letter C. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
I am aware that 9.8ms is only %0.08 of the 11 seconds
Sorry, but you've answered yourself there. 0.08% is
- 1 extra second on a webpage that takes 20 minutes to load.
- 1 extra minute on a batch job that takes a whole day.
- 1 extra hour on a process that takes 2 months run.
And that's ignoring all the other benefits -- already listed above -- of not having to pass this God-reference around everywhere.
I still say death by a 1000 cuts!
More like 200,000,000 (1/0.000000005) itsy-bitsy, teany-weany paper cuts.
You (and the implementers of dTHX/aTHX_ et al.) are fiddling while Rome burns. The very definition of premature optimisation.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
| [reply] [Watch: Dir/Any] |
Re^12: Use perl type without perl
by bulk88 (Priest) on Sep 26, 2012 at 23:04 UTC
|
Has to use a huge multiplier -- 1 billion iterations -- to make a mountain out of a molehill.
Let's say the total runtime was the upper of your vague estimate. 8 seconds.
Which means:
- 'cxt' took 1.5 seconds for 1e9 operations = 0.0000000015 s/iteration.
- 'noctx' took 6.5 seconds for 1e9 operations = 0.0000000065 s/iteration.
By any body's standards, a whole 5 billionths of a second difference is hardly "huge". (Which was your assertion).
Irrelevant. Would you recommend to eliminate SSE1-4 since the difference is only a billionth of a second between a x87 and SSE* operation? As an infamous monk on PM likes to say, cpu usage is irrelevant because you did I/O. Not true, a modern PC is not running MSDOS. If my CPU is loaded to 100% (getting your moneys worth from hardware) I absolutely would like the process, any process, to compete its work in less cycles. Every cycle saved means a free cycle for the next process to run in, or less energy usage, either for battery or power bill, since the CPU was sent into a low power state by the kernel until the next interrupt, the few times I've played with the kernel debugger, 100% of breaks landed in MS NT Kernel's http://doxygen.reactos.org/d7/d08/arm_2thrdini_8c_source.html#l00153, which is a good thing.
In the noctx, you are using the equally flawed Perl_get_context()
Which, as you point out, entirely swamps the call to TLSGetValue(), by bracketing it with (useless*) calls to GetLastError() and SetLastError().
As we discussed before, what Last error are they preserving, that is important enough to be preserved, not important enough to be reported straight away?
And, if there is justification for preserving some system errors whilst ignoring other, why preserve them in OS memory thus requiring every unimportant system call to be bracketed with GLE/SLE? Why not get the error just after the important system call that caused it and put it somewhere local?
That way, you do one GetLastError() call after each (significant) system call that you want to preserve; rather than bracketing every insignificant system call with two other system calls.
Win32 Perl's architecture emulates various parts of POSIX in win32.c or in MS's CRT, this layer is less than ideally designed, its actually crap IMO. A design choice was made to use, probably due to budget reasons, (dPERLOBJ = dTHX today), to not change the function signatures, and not to pass my_perl but use dTHX instead, (one of of many commits that adds dTHX everywhere http://perl5.git.perl.org/perl.git/blobdiff/4f4e629e089f1120f8e94984281df06ac4f885c5..0cb9638729211ea71a75ae8756c03ba21553bd53:/win32/win32.c ) Originally dTHX was what you wanted, was a plain macro to TlsGetValue, (see http://perl5.git.perl.org/perl.git/blob/ea0efc06fdad2019ffceb86d079dd853e9d79cea:/win32/win32thread.h#l81), but soon after in http://perl5.git.perl.org/perl.git/commit/ba869debd80c55cfae8e9d4de0991d62f9efcb9b?f=win32/win32thread.c LastError was added with no explanation or code comments. You can try asking Jan Dubois about the LastError saving or should I start a Kickstarter project to hiring a medium for Perl? (not fair, Gurusamy isn't dead just retired). IDK if Jan would be able to answer the question using internal records at AS, I couldn't find anything on http://bugs.activestate.com/.
My prime suspect for why TLSGetValue() doesn't get inlined, is the fact that it is bracketed by those other two calls. I'd love to see you add a 3rd test to your benchmark that calls TLSGetValue() directly. I'm not saying it will be inlined, but even if it isn't, it would reduce the (already nanoscopic) difference quite considerably.
The most likely reason that TLSGetValue is not inlined is it would break ABI between releases of Windows. TLS lives in the TEB struct, an undocumented struct. It is different between DOS Windows and NT kernel, and probably different between different versions of the NT kernel. On the topic of Win32 API calls that should be inlined but are not, InterlockedCompareExchange started being inlined in VS 2005 or VS 2008 (saw it personally in VS 2008). My VS 2003 does not inline InterlockedCompareExchange and calls to Kernel32 always. Per google, x86 cmpxchg was added in the 486. Windows 95/98 were designed with 386 compatibility, in WinME, disassembly shows InterlockedCompareExchange is a kernel call with a system service table number. I'll guess and say it is NOT implemented with lock cmpxchg. Some trivia, SHInterlockedCompareExchange, new in shlwapi v5 (from IE 5) is implemented as lock cmpxchg, I think this shlwapi v5 was intended to run on NT and DOS Win but on 486 and newer. There is also a Win16 IE 5, but I dont have time to RE it. Perhaps in Win16 you dont even need InterlockedCompareExchange at all since all context switches are 100% voluntary and there are no threads.
I've only written about the 1st 1/2 of your post. I'll analyze the last half of your post and update this post soon with your C code and a no last error TlsGetValue test.
| [reply] [Watch: Dir/Any] |
|
Irrelevant. Would you recommend to eliminate SSE1-4 since the difference is only a billionth of a second between a x87 and SSE* operation?
That is a complete red herring.
The point is that if paying the cost of 5 nanoseconds in one place enables the saving of 50 or 500 nanoseconds somewhere else, the trade off is eminently worth while.
I'm always in favour of optimising code that gets reused by many projects with as many different performance criteria as the Perl runtime; but you have to target your optimisations. And obsessing about 5 nanoseconds in one place without considering the wider implications -- the greater net gain optimisation possibilities that will be disabled -- by opting for a given micro-optimisation, is naive and shortsighted.
Win32 Perl's architecture emulates various parts of POSIX in win32.c or in MS's CRT, this layer is less than ideally designed, its actually crap IMO.
Ah! Something we can agree on. :)
As for the perl development archeology; it is of little interest. We are where we are now. How we got here doesn't matter.
The only interesting questions are:
- Going forward, is there anyway to improve what we have now?
I've expounded at length that I believe that there are ways to improve the current status quo.
But, I feel that to do so would require considerably more radical changes than are currently ever considered viable.
The problem -- of perl's lack-lustre performance -- needs to first be tackled top down, root and branch, looking at what Perl expends most of its cycles doing; and how that might be improved.
Only once the top-down flow of code has been improved would it be worth doing bottom up micro-optimisations.
- Is there the collective will to tackle the task?
I think recent related discussion here answer that question.
I fail to see any relevance -- to anything -- in all your discussion of long dead versions of windows.
As I said above, in the wider scheme of things, the 5 nanoseconds we are discussing here are irrelevant.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
| [reply] [Watch: Dir/Any] |
|
|