http://www.perlmonks.org?node_id=995611


in reply to Re^7: Use perl type without perl
in thread Use perl type without perl

Passing my_perl on the stack/register is alot faster than the GLR and SLR and TlsGetValue calls. If it was worth it, any C compiler would put my_perl back on the C stack if it wasn't being used for a long stretch of code but since every time PL_* written, plus more PL_*s hidden in various object-like macros, is a dereferencing of my_perl. Its basically impossible.
And (again guessing) other than to pass it on to perl internal functions they call, 90%+ of user XS functions make no use of the context handle they get passed.
Nope. Every XS function starts with this, but most of these values except for sp and ax are tossed very soon by the C compile because of liveness analysis. ix is added by ParseXS only for ALIAS: xsubs.
SV **sp = (my_perl->Istack_sp); I32 ax = (*(my_perl->Imarkstack_ptr)--); register SV **mark = (my_perl->Istack_base) + ax++; I32 items = (I32) (sp - mark); I32 ix = ((XPVCV *) ((void *) ((cv)->sv_any)))->xcv_start_u.xcv_xs +ubany.any_i32;
I took the OP to mean that he would like to use some of the perl data types in a completely non-Perl piece of code. Just as a convenient source of am efficient hash implementation. And to that end, it ought to possible to use them without perl contexts coming into it at all. The fact that it isn't possible is symptomatic of one (of many) of the anachronisms in the way the perl sources are laid out.
How would you handle DESTROYers, cache invalidators, fixed bucket mem allocators and shared string table/shared HEKs (and maybe things that i dont know of) all which could be called when get/set/rmv a hash slice. It sounds like the OP wants to separate C parts out of Perl and use them without an interp. The other monks made responses that I read as "thats perfect, 1 catch, no threads", when that is not true. Unless I'm wrong about the implementation of a no threads perl, where upon loading the DLL, from DllMain the interp will always initialize itself and perlembed stuff isn't necessary then. From what I read a while ago, in a no threads perl, the perl interp struct (my_perl) is allocated in .data/RW global data section in the DLL. Through the export table you get my_perl everywhere in your code since its an extern C global.
I was completely baffled by your reference to GObject at first, but now I'm guessing that you are suggesting that if the OP wants a ready source of a hash implementation, GLib comes with one he might find more convenient than perl's? That could be more clearly stated.
GObject and Perl's GC systems have many similarities. Someone might say "I use Glib only for reference counting/other things Perl provides, that sucks, GLib is fat, let me try the faster/better/super/less bloated Perl GC system without Perl in my pure C app". I am saying that would be a bad choice. Glib and Perl both have their places. They are not interchangeable. I could have written that and replace the word GObject with COM and compared COM to Perl. Perl is not a standard library for pure C apps. It is not NSPR or GLib.

Replies are listed 'Best First'.
Re^9: Use perl type without perl
by BrowserUk (Patriarch) on Sep 25, 2012 at 19:10 UTC
    Passing my_perl on the stack/register is alot faster than the GLR and SLR and TlsGetValue calls.

    Ignoring that I don't know what "GLR and SLR" are -- and you do not bother to explain them -- I'd be interested to see proof of that "much faster". Faster I have no doubts, but much faster?

    See "TlsGetValue was implemented with speed as the primary goal."

    And I counter that assertion with: speed isn't everything.

    Burdening every function, and every programmer, with the need to accommodate a 'pass-through variable' and relying upon the optimiser to make it disappear when not required -- all to save what effectively becomes something like mov rax, GS:[8*rcx+0x2c] -- is short-sighted in the extreme.

    And wrapping it over in a bunch of "trick" macros make the programmer burden -- via cognitive disconnection -- even worse.

    I took the OP to mean...

    How would you handle ...

    I wouldn't. Just because I took the OP to mean that; doesn't mean that I think that it is a good idea, or even possible.

    I only attempted to answer -- at perhaps a superficial level -- the OPs question: "I'm curious why non-threaded perl can do what thread perl can't do.". Naught more.

    Which is why I think your post would have been better directed at the alternative you suggested.

    GObject and Perl's GC systems have many similarities. ... I am saying that would be a bad choice.

    Then why mention it? No one else did.

    Aren't you just as guilty of misdirection by bringing it up and leaving it hanging as the guy that suggested: "you'll be fine so long as you don't use threads"? Which seems to be the focus of your posts.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Ignoring that I don't know what "GLR and SLR" are -- and you do not bother to explain them -- I'd be interested to see proof of that "much faster". Faster I have no doubts, but much faster?

      See "TlsGetValue was implemented with speed as the primary goal."

      And I counter that assertion with: speed isn't everything.

      My new motto, death by a thousand cuts.

      Win XP TlsGetValue has 3 branches in asm. 1 branch on the found value path. On found value path, other than mandatory stack frame maintenance, deref FS Register, deref c stack index val, cmp index val to const, cond jump, and TEB in regular register + offset with const 0 (setlasterror = 0), move TEB in regular register+index reg+ SIB encoded constant to eax, return. A total of 11 machine opcodes executed, stack frame maintenance included. That is also ignoring the SetLastError and GetLastError done by Perl before and after TlsGetValue. Now time for some real world numbers.
      void CxtSpeed() PREINIT: LARGE_INTEGER start; LARGE_INTEGER end; int i; PPCODE: QueryPerformanceCounter(&start); for(i=0; i < 1000000000; i++){ no_cxt(); } QueryPerformanceCounter(&end); printf("no cxt %I64u\n", end.QuadPart-start.QuadPart); QueryPerformanceCounter(&start); for(i=0; i < 1000000000; i++){ cxt(aTHX); } QueryPerformanceCounter(&end); printf("cxt %I64u\n", end.QuadPart-start.QuadPart);
      //separate compiland/obj file #define PERL_NO_GET_CONTEXT #include <EXTERN.h> #include <perl.h> #include <XSUB.h> __declspec(dllexport) int no_cxt(){ dTHX; return ((int) my_perl) >> 1; } __declspec(dllexport) int cxt(pTHX){ return ((int) my_perl) >> 1; }
      #in makefile.pl hash to WriteMakefile FUNCLIST => ['no_cxt', 'cxt'], dynamic_lib => { OTHERLDFLAGS => ' noopt.obj ' , INST_DYNAMIC_DEP => 'noopt.obj' },
      Make sure to check disassembly to make it wasn't all inline optimized away. cxt() loop was completely removed in my 1st try.
      C:\Documents and Settings\Owner\Desktop\cpan libs\lxs>perl -MLocal::XS + -e "Local:: XS::CxtSpeed();" no cxt 48160819 cxt 11096124 C:\Documents and Settings\Owner\Desktop\cpan libs\lxs>
      Whole script took about 5-8 seconds. 1 Perl_get_context took 4.3 times more time than passing it on the C stack and of course everything tested fit L1 the whole time. I would much rather have my_perl on the C stack or in a register (compiler's choice) than call Perl_get_context half a dozen or more times in every Perl C function. If you want to know why TlsGetValue is never optimized away to inline assembly, ask MS. I didn't write Kernel32.

      I'm surprised it only took 4 times longer. GetLastError is 3 opcodes, stack frame maintenance included. SetLastError is 8 opcode, stack frame maintenance included. TlsGetValue was 11 opcodes, stack frame maintenance included. Perl_get_context is 13 opcodes, no branches, stack frame maintenance included. 3 opcodes for no_cxt, stack frame maintenance included. Total of 38 opcodes for no_cxt. A total of 3 opcodes for cxt(), stack frame maintenance included. So, no_cxt took 12.6 times more opcodes than cxt, yet only 4.3 times more time. Did my superscalar Core 2 eliminate all those function calls to one function call with IPO when it recompiled x86 asm to microop asm or branch predictor + cache dirty flag checking removed the code? IDK, but interesting numbers anyway. In any case my_perl in a register/c stack wins.

      Which is why I think your post would have been better directed at the alternative you suggested.

      Should I delete my post and post it to the other post?
      GObject and Perl's GC systems have many similarities. ... I am saying that would be a bad choice.

      Then why mention it? No one else did.

      The OP was vague and didn't concisely explain anything, so I had to consult with my crystal ball that I got at the pound shop made from lead wiring, chip board and bitumen, to read the OP's mind. My crystal ball said he is using Perl in a DLL that doesn't link with Perl but includes Perl's headers for Perl's GC. Should I use my O'Reilly brand tarot cards in the future?

      Aren't you just as guilty of misdirection by bringing it up and leaving it hanging as the guy that suggested: "you'll be fine so long as you don't use threads"? Which seems to be the focus of your posts.

      I gave an answer
      Someone who didn't read the manual will think they can use Perl C data structures without a "useless" Perl around instead of GObject
      If you read the manual (perlapi/perlguts/illguts/perlembed/perlxs), you will know that using Perl without an initialized interp is not supported by Perl.

        Separating the useful from the non-useful. I reply to the first part of your post separately.

        Should I delete my post and post it to the other post?

        No. But had you aimed more carefully, you might be righting the wrong you perceived, byt discussing it with the guy that perpetrated it, rather than having this (pointless part of this) discussion with me.

        The OP was vague and didn't concisely explain anything, ...

        So, to correct wrong of bad information -- that someone else posted -- you supplied some equally bad information, in reply to me?

        Let's call this part of the discussion a misunderstanding and close it.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        First, thank you for arguing with numbers. It is a rare event and most welcomed.

        But --- you knew that was coming right -- your benchmark:

        1. Has to use a huge multiplier -- 1 billion iterations -- to make a mountain out of a molehill.

          Let's say the total runtime was the upper of your vague estimate. 8 seconds.

          Which means:

          'cxt' took 1.5 seconds for 1e9 operations = 0.0000000015 s/iteration.
          'noctx' took 6.5 seconds for 1e9 operations = 0.0000000065 s/iteration.

          By any body's standards, a whole 5 billionths of a second difference is hardly "huge". (Which was your assertion).

          And if the body of the loop did anything useful -- like call one or two of the huge macros or long twisty functions that are the reason for having the context within the sub in the first place --then those 5 nanoseconds would just disappear into the noise.

        2. In the noctx, you are using the equally flawed Perl_get_context()

          Which, as you point out, entirely swamps the call to TLSGetValue(), by bracketing it with (useless*) calls to GetLastError() and SetLastError().

          As we discussed before, what Last error are they preserving, that is important enough to be preserved, not important enough to be reported straight away?

          And, if there is justification for preserving some system errors whilst ignoring other, why preserve them in OS memory thus requiring every unimportant system call to be bracketed with GLE/SLE? Why not get the error just after the important system call that caused it and put it somewhere local?

          That way, you do one GetLastError() call after each (significant) system call that you want to preserve; rather than bracketing every insignificant system call with two other system calls.

          My prime suspect for why TLSGetValue() doesn't get inlined, is the fact that it is bracketed by those other two calls. I'd love to see you add a 3rd test to your benchmark that calls TLSGetValue() directly. I'm not saying it will be inlined, but even if it isn't, it would reduce the (already nanoscopic) difference quite considerably.

        3. Most significantly -- you've tested something quite different to that I was trying to describe.

          The reason functions need to have visibility of the context, is because some of the functions they call, require it be passed to them.

          This requirement is often hidden by wrapping the functions that need it in macros. You know better than I do how grossly unwieldy many of the wrapper macros get.

          There is a common pattern to many of the worst ones, that goes something like this:

          #define SOMETHING1 STMT_START { assert( something ); if(some_complex_c +ondition) wrapped_function1( aTHX_, ... ); assert(something_else ) } +STMT_END #define SOMETHING2 STMT_START { assert( something ); if(some_complex_c +ondition) wrapped_function2( aTHX_, ... ); assert(something_else ) } +STMT_END #define SOMETHING3 STMT_START { assert( something ); if(some_complex_c +ondition) wrapped_function3( aTHX_, ... ); assert(something_else ) } +STMT_END int someFunction( aTHX_ ... ) { dATHX; ...; SOMETHING1( ... ); ...; SOMETHING2( ... ); ...; SOMETHING3( ... ); RETURN; }

          The logic being (I assume) that by testing the conditions inline, you prevent the call overhead for the cases where the condition(s) fail.

          But a simple test shows that it isn't the case:

          With x1(), 50% of calls are avoided by an inline conditional test.

          With x2(), that test is moved into the body of the function, which returns immediately if the test fails.

          Compile & run:

          C:\test>cl /Ox calloverhead.c Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. calloverhead.c Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:calloverhead.exe calloverhead.obj C:\test>calloverhead 10000000 Inline condition: 60,068,106 Inbody condition: 45,064,458 C:\test>calloverhead 10000000 Inline condition: 60,037,515 Inbody condition: 45,084,879 C:\test>calloverhead 10000000 Inline condition: 60,048,828 Inbody condition: 45,057,681 C:\test>calloverhead 10000000 Inline condition: 60,032,691 Inbody condition: 45,032,724

          The inline condition takes 1/3rd more cycles than putting the test inside the body of the function call!

          And if the conditional tests are inside the body of the functions, you no longer need the macro wrappers -- which makes things a lot clearer for the programmer.

          And you also don't need access to the context in all the callers of the wrapped functions, so then the called function can obtain the context internally, thus removing it from visibility at the caller's level.

          And the code size shrinks because the conditional test appears once inside the function rather than at every call site.

          That's a 3 way win, with no downsides.

        The point is that you cannot take one single aspect of the overall vision, mock it up into a highly artificial benchmark and draw conclusions. You have to consider the entire picture.

        Of course, it is never going to happen, so there is little point in arguing about it; but if you did effect this kind of change throughout the code base; along with all the other stuff we discussed elsewhere; the effects can be significant.

        The hope for using LLVM to compile the Perl runtime, is that by re-writing the macro-infested C sources to IR, and combining them with current compilation unit of Perl code that uses it -- also suitably compiled to IR; it can see through both the macros and the disjoint runloop, and find optimisations on a case-by-case basis that cannot be made universally.

        That is to say, (by way of example), a piece of code that uses no magic, and only IVs or UVs, may qualify for optimisations that could not be made statically by a C compiler, because -- given the current structure of the pp_* opcode functions -- it could never possibly see them; as it always has to allow for the possibility of magic; and NVs; and PVs; et al.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong