Has to use a huge multiplier -- 1 billion iterations -- to make a mountain out of a molehill.
Let's say the total runtime was the upper of your vague estimate. 8 seconds.
'cxt' took 1.5 seconds for 1e9 operations = 0.0000000015 s/iteration.
'noctx' took 6.5 seconds for 1e9 operations = 0.0000000065 s/iteration.
By any body's standards, a whole 5 billionths of a second difference is hardly "huge". (Which was your assertion).
Irrelevant. Would you recommend to eliminate SSE1-4 since the difference is only a billionth of a second between a x87 and SSE* operation? As an infamous monk on PM likes to say, cpu usage is irrelevant because you did I/O. Not true, a modern PC is not running MSDOS. If my CPU is loaded to 100% (getting your moneys worth from hardware) I absolutely would like the process, any process, to compete its work in less cycles. Every cycle saved means a free cycle for the next process to run in, or less energy usage, either for battery or power bill, since the CPU was sent into a low power state by the kernel until the next interrupt, the few times I've played with the kernel debugger, 100% of breaks landed in MS NT Kernel's http://doxygen.reactos.org/d7/d08/arm_2thrdini_8c_source.html#l00153
, which is a good thing.
In the noctx, you are using the equally flawed Perl_get_context()
Which, as you point out, entirely swamps the call to TLSGetValue(), by bracketing it with (useless*) calls to GetLastError() and SetLastError().
As we discussed before, what Last error are they preserving, that is important enough to be preserved, not important enough to be reported straight away?
And, if there is justification for preserving some system errors whilst ignoring other, why preserve them in OS memory thus requiring every unimportant system call to be bracketed with GLE/SLE? Why not get the error just after the important system call that caused it and put it somewhere local?
That way, you do one GetLastError() call after each (significant) system call that you want to preserve; rather than bracketing every insignificant system call with two other system calls.
Win32 Perl's architecture emulates various parts of POSIX in win32.c or in MS's CRT, this layer is less than ideally designed, its actually crap IMO. A design choice was made to use, probably due to budget reasons, (dPERLOBJ = dTHX today), to not change the function signatures, and not to pass my_perl but use dTHX instead, (one of of many commits that adds dTHX everywhere http://perl5.git.perl.org/perl.git/blobdiff/4f4e629e089f1120f8e94984281df06ac4f885c5..0cb9638729211ea71a75ae8756c03ba21553bd53:/win32/win32.c ) Originally dTHX was what you wanted, was a plain macro to TlsGetValue, (see http://perl5.git.perl.org/perl.git/blob/ea0efc06fdad2019ffceb86d079dd853e9d79cea:/win32/win32thread.h#l81
), but soon after in http://perl5.git.perl.org/perl.git/commit/ba869debd80c55cfae8e9d4de0991d62f9efcb9b?f=win32/win32thread.c
LastError was added with no explanation or code comments. You can try asking Jan Dubois about the LastError saving or should I start a Kickstarter
project to hiring a medium for Perl? (not fair, Gurusamy isn't dead just retired
). IDK if Jan would be able to answer the question using internal records at AS, I couldn't find anything on http://bugs.activestate.com/
My prime suspect for why TLSGetValue() doesn't get inlined, is the fact that it is bracketed by those other two calls. I'd love to see you add a 3rd test to your benchmark that calls TLSGetValue() directly. I'm not saying it will be inlined, but even if it isn't, it would reduce the (already nanoscopic) difference quite considerably.
The most likely reason that TLSGetValue is not inlined is it would break ABI between releases of Windows. TLS lives in the TEB struct
, an undocumented struct. It is different between DOS Windows and NT kernel, and probably different between different versions of the NT kernel. On the topic of Win32 API calls that should be inlined but are not, InterlockedCompareExchange started being inlined in VS 2005 or VS 2008 (saw it personally in VS 2008). My VS 2003 does not inline InterlockedCompareExchange and calls to Kernel32 always. Per google, x86 cmpxchg was added in the 486. Windows 95/98 were designed with 386 compatibility, in WinME, disassembly shows InterlockedCompareExchange is a kernel call with a system service table number. I'll guess and say it is NOT implemented with lock cmpxchg. Some trivia, SHInterlockedCompareExchange, new in shlwapi v5 (from IE 5) is implemented as lock cmpxchg, I think this shlwapi v5 was intended to run on NT and DOS Win but on 486 and newer. There is also a Win16 IE 5, but I dont have time to RE it. Perhaps in Win16 you dont even need InterlockedCompareExchange at all since all context switches are 100% voluntary and there are no threads.
I've only written about the 1st 1/2 of your post. I'll analyze the last half of your post and update this post soon with your C code and a no last error TlsGetValue test.