Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^9: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?

by BrowserUk (Pope)
on Aug 29, 2012 at 15:51 UTC ( #990498=note: print w/ replies, xml ) Need Help??


in reply to Re^8: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
in thread Perl 5 Optimizing Compiler, Part 4: LLVM Backend?

(And you can get most of the gains on the 'bad' XS code just by using 5.14.0 or later, and using a non-threaded (or non-multiplicity) build.

If the gains are there to be had -- and they are -- with a sufficiently wide field of view, then they can be applied at LTO time. With argument promotion and function re-writing and SSA lifting.

Or, for local subtrees of code, they can be applied at runtime using JIT with a sufficiently deep view.

In other words, non-multiplicity efficiencies selectively applied ot subtrees of code on multiplicity builds running single threaded processing. Or even, subtrees of threaded coded that doesn't access shared data.

Note also that LLVM is very unlikely to be able to able to optimise away any of the get_context() calls in the XS code.

Why? If they can be #defined away, why can they not be optimised away?

I anticipate your answer is: because the compiler cannot see widely enough to know to do it. And that is very definitely true for C compilers doing compilation unit optimisations.

But for a non-C optimiser running whole program analysis and/or SSA-analysis, there is no reason it couldn't lift the get_context up to whatever level it last changed, and make it a Single Static Assignment. And perhaps place it in a register.

If you believe differently, the onus ...
  • It would be the work of minutes for you; and (probably) days for me.

    Besides which, I do not feel either of us should feel any "onus", because ... see next bullet point.

    But as you suggested that seeing it might clarify something ...

  • All it would prove, is that on one compilation, with one configuration, on one platform something did or did not get optimised.

    The beauty of the LLVM approach to optimisation is that it (can) produce optimisations in a platform and compiler independent manner -- at its IF level -- that then get translated into whatever is the best machine-dependent implementation at compile-time or link-time or runtime.

Well, my speculation, based on 11 years experience of working on the perl core, is that improvements with LLVM will come into the "10% better" category rather than the "5x better" or "perl runs at C speed" categories. Which is where this all started. Don't get me wrong, I'm not opposed to things like LLVM; I just haven't been convinced yet that they can offer game-changing improvements.

I only joined this thread belatedly, to counter the negative energy of "it couldn't work", "it wouldn't work", "it cannot be done", and "it would be a waste of energy to try".

And please ignore any possible negative implications of this, but I did so because I saw -- and see -- people thinking of LLVM as simply another C compiler. And citing "failed" projects where it has been used in exactly that way. All I'm trying to do is to get people to stop thinking in C; read a little of the tutorial information, and consider the possibilities.

Your knowledge and experience is -- and would be -- invaluable to such a project, even if only in an advisory capacity if that's all you have the time and interest for. And all I'm trying to do here is persuade you (and others) to consider the possibilities. If I can speculate with my dearth of knowledge of the internals, what might someone like you see, if you were of a mind to understand that LLVm is nothing like any other "compiler"?

>Another mind's eye speculation for you to consider:

Much of the costly code in PP_* routines is there to deal with the 5 or 10% cases. Imagine if -- manually for now -- each SV was born marked (or better unmarked) as carrying magic or not. Say a single bit in the (unused) least-significant 3 or 4 bits of the SVs address, was used to flag when an SV had magic attached to it.

And then imagine a whole duplicate set of pp_* routines that never even considered the possibility of magic, and only called other functions that similarly never considered magic.

And then in the runloop, a one-time inspection was made and the appropriate chain of pp_no_mg_* or normal pp_* routines was dispatched to.

Do you see any gains to be had from running the 90% or 95% (?) of code that doesn't use magic, through a set of opcodes that not only don't test for magic, but also do not carry all the extra variables and conditional branch code that can confuse and overpower the C optimiser and/or the CPU branch predictor?

Please read http://llvm.org/pubs/2004-09-22-LCPCLLVMTutorial.pdf, and see that LLVM was able to do stuff not unlike the above 8 years ago. And things have moved on. A lot!

(There is a fairly detailed walk through of a moderately complex example starting at slide 28, but you would probably need to read the preceding to understand the nomenclature.)

Again, no offense is meant by this, but maybe you have been so close to the Perl sources for so long, that you view them the way a C compiler does. And so cannot imagine the possibilities of looking at them not as C source, or even C compiler generated machine code, but rather with Explicit SSA dataflow; and an explicit control-flow graph; and explicit language-independent type information; and (most importantly in the case of Perl), explicitly typed pointer arithmetic; in conjunction with an infinite SSA register set and loads/stores with typed pointers.

I can see that I do not have the gravitas to convince you. So, please, put aside my speculations and read the .pdf, and allow it to engender your own speculations about the possibilities. Because your speculations would be so much more insightful and therefore useful than mine.

And, if you read it, it will engender your speculations.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re^9: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
Re^10: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by dave_the_m (Parson) on Aug 29, 2012 at 17:27 UTC
    (I'll read the pdf later and come back and comment on it further, but for now):

    Imagine if -- manually for now -- each SV was born marked (or better unmarked) as carrying magic or not. Say a single bit in the (unused) least-significant 3 or 4 bits of the SVs address, was used to flag when an SV had magic attached to it.

    But magic is a dynamic, runtime thing, not something that can be determined at the time an SV is created. For example:

    use Devel::Peek; my $x = "abc"; Dump $x; # here $x is not magic $x =~ /./g; Dump $x; # now it is

    So I don't see how that could work.

    Also, in most binary perl ops, there is a macro call at the top of each function, tryAMAGICbin_MG(), that ORs the flags fields of the top two SVs on the stack together, then checks to see if the combined flags word indicates that either SV is magic or is overloaded, and if so, calls a function which handles this. The rest of the pp_ function can continue while assuming that its args aren't special:

    #define tryAMAGICbin_MG(method, flags) STMT_START { \ if ( ((SvFLAGS(TOPm1s)|SvFLAGS(TOPs)) & (SVf_ROK|SVs_GMG)) \ && Perl_try_amagic_bin(aTHX_ method, flags)) \ return NORMAL; \ } STMT_END

    Consider the following trivial code:

    my $x = 1; my $y = 2; my $z; for (1..1000) { $z = $x + $y; }

    cachgrind shows that the tryAMAGICbin_MG() line at the top of pp_add() does 7 instruction reads, 3 data reads and 1 branch per call (with no significant cache or branch predictor fails), while the pp_add() function as a whole does 85 instruction reads. So even if there was some way to know in advance to call a non-magic / non-overload version of pp_add (and I don't think there is) the best improvement you could theoretically achieve is 9%.

    So you've shown me an example that (a) can't work, and (b) even if it did, doesn't show me that much could be gained.

    Don't get me wrong, I want to be inspired; but I just haven't seen anything yet that indicates we could get more than that 10%.

    Dave

      But magic is a dynamic, runtime thing, not something that can be determined at the time an SV is created.

      I realise that -- but don't get hung up on my speculations. They are not proposals to be refuted. I'm making them up as I go, based upon the types of things LLVM can do, and my extremely limited knowledge of the internals.

      That's exactly why it is important to get someone like you, with your depth of knowledge, to look at LLVM in detail. Once you understand the possibilities, your speculations would be so much more targeted than mine.

      cachgrind shows that the tryAMAGICbin_MG() line at the top of pp_add() does 7 instruction reads, 3 data reads and 1 branch per call (with no significant cache or branch predictor fails), while the pp_add() function as a whole does 85 instruction reads. So even if there was some way to know in advance to call a non-magic / non-overload version of pp_add (and I don't think there is) the best improvement you could theoretically achieve is 9%.

      So you've shown me an example that (a) can't work, and (b) even if it did, doesn't show me that much could be gained.

      Yes. But you've only considered one level deep in a simple opcode. In something like pp_substr there is so much conditional branching and test for magic and utf (and other stuff), on multiple parameters, and then calls to upgrades and ... you know what's in there far better than I

      Its not just the single opcodes, but all the calls made from within them that make these types of checks. It is the application of, the wide view that Whole-Program Analysis and LTO and forgetting C language rules, can bring to the table that could be substantial, not beating the human programmer on individual routines.

      Programmer reputedly can remember 5 or 7 or 8 things at any given moment; programs can build entire graphs and find patterns spanning those graphs. They don't forget things.

      want to be inspired; but I just haven't seen anything yet that indicates we could get more than that 10%.

      I think that's because you haven't yet seen the full potential of LLVM, and I'm failing to describe it well enough and my attempts to speculate are clumsy. Please do read the pdf.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

      pmsig
      I'm excited to see what Dave thinks after he reads the LLVM docs! :)
Re^10: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by dave_the_m (Parson) on Aug 29, 2012 at 17:46 UTC
    I forgot to reply to this specific point:

    Me: Note also that LLVM is very unlikely to be able to able to optimise away any of the get_context() calls in the XS code.

    You: Why? If they can be #defined away, why can they not be optimised away?

    get_context() requires (on threaded builds) to retrieve a value from thread-local storage. This requires (at least the last time I looked in the linux pthreads library) the current stack pointer to be compared against the stack range allocated to each current thread, to determine which thread we are. I'd be amazed if LLVM could determine that the value won't change between calls!

    PERL_NO_GET_CONTEXT causes the value (the address of the current perl interpreter) to be passed as an extra argument to each XS function, rather than calling get_context() every time a function in the perl library needs to be called.

    Dave

      I'd be amazed if LLVM could determine that the value won't change between calls!

      Sorry, but I do not understand this. One thread == one interpreter == one stack. Code cannot call between threads, and threads do not switch stacks (easily).

      If the address of the current interpreter was stored in (say, on Intel) the FS or GS register, when the scheduler preempts one thread and loads another, it would restore that register, just as it does for the SP and IP and all the others. Every time a particular thread/interpreter is running, its context would be available in that register; and (on Intel) all the offsets within that context structure would be available as indexed addressing via that segment register.

      On other processors, mostly register rich compared to Intel, the same thing could be done.

      Basically, the 'reserved' register is loaded with the context when the thread (including the first) is started, and if nothing else is allowed to touch that register -- pretty much guaranteed on Intel -- it value is always available without needing to pass it around to every function.

      LLVM might need a custom addition to its code generation routines (for each platform), to support that, but LLVM allows that to be done.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        The Perl stack has nothing to do with the C stack. And a Perl interp has no connection to a particular OS thread, just that it can only be on 1 OS thread at any given time (except when you press Ctrl-C in VC/Mingw Perl, 10% of the time you crash or panic since the interp is running in 2 different threads). A Perl interp can be moved between OS threads whenever and as much as anyone wants aslong as PERL_SET_CONTEXT is called correctly to update the TLS portably. There is no normal way on X86 Windows to have a thread specific global register unfortunately. Changing FS/GS register is quite dangerous since you will crash if you call anything in (kernel/user/gdi)32 and FS/GS isn't the TIB *. A very adventurous compiler might be crazy enough to do things like reuse ESP (I said ESP, not EBP) and FS registers on Windows. That compiler would probably have to have anonymous developers because of all the death threats and malpractice accusations they will get for writing a virus making compiler. Of course a compiler is free to use any calling convention for statics. Perhaps LLVM can be used to create an abstract Perl specific calling convention that is portable to different OSes and CPUs, and on X86 Windows would result in a thread specific global register.
      This requires (at least the last time I looked in the linux pthreads library) the current stack pointer to be compared against the stack range allocated to each current thread, to determine which thread we are.

      As an aside, I don't understand why the context needs to be passed from function to function to function anyway?

      On windows, you allocate a global uint, assign the return from DWORD tlsindex = TlsAlloc() to it at process startup, and then whenever thread needs access to its TLS, it calls TlsSetValue( tlsIndex, pointer ); and struct context p = TlsGetValue( tilsIndex );.

      And I read that pthreads has the equivalent pthread_key_create(), pthread_setspecific(), pthread_getspecific() calls which perate in a similar fashion.

      So why does context have to be passed around, and reasserted in every routine, even those that (apparently) do not make any ised of it or what it points to?

      I really do not understand why the perl context cannot be retrieved from TLS when it is needed, rather than passed around like a baton?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        I really do not understand why the perl context cannot be retrieved from TLS when it is needed, rather than passed around like a baton?
        Because on linux pthreads at least, retrieving a value from TLS is very expensive (using the "match the SP to a thread" trick). Passing an extra parameter is a lot quicker.

        As to why they don't reserve a register for it, you'd have to ask them. But I would speculate that it boils down to the fact that you can't assume that all code from all libraries compiled by all the different possible compilers on all the different platforms all know and agree that they're not allowed to touch register X. If this had been agreed on day 1, then maybe; not so much now.

        Dave.

        The perl context is passed around as pTHX and aTHX as a C param. Only legacy/abandoned XS code and the C Lib emulators in Perl call Perl_get_context() nowadays. Not using PERL_NO_GET_CONTEXT for all XS code in the last 10 years is a sin. Then again, most CPAN authors don't care (I'm staring at you Win32:: namespace).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://990498]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (13)
As of 2014-08-27 18:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (248 votes), past polls