Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^4: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?

by dave_the_m (Parson)
on Aug 28, 2012 at 22:30 UTC ( #990345=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
in thread Perl 5 Optimizing Compiler, Part 4: LLVM Backend?

discounting only the amount of work needed to fully implement Perl5-to-LLVM, would you please tell me which part of BrowserUK's 3-phase idea is not producing the "I see how that might work!" epiphany in your mind?
Well, since the current perl interpreter has neither an AST, nor bytecode, I don't really understand the proposal.

But my basic issue in this case is how do you get the 5X speedup rather than the 10%? I understand in great detail exactly what C-level code the perl interpreter executes while running an ops loop (and the ops within it). No-one has yet explained to me in a way I can understand how all the necessary stuff to carry out those actions will somehow go faster.

For example, to do a subroutine call in perl, you need to do a whole bunch of stuff like

  • set up @_ and copy the stack items to it;
  • set up a context stack entry so that return etc know what to do;
  • create a new scope stack level;
  • bump the reference count of the live CV so that it can't be freed in mid-execution
  • make the appropriate pad the current one;
  • etc.
All (or at least the vast majority) of this has still to be done, unless the perl interpreter is radically redesigned somehow. At the moment all this is done by a single, tuned C function. I don't see LLVM making it significantly faster.

Dave


Comment on Re^4: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
Re^5: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by Will_the_Chill (Pilgrim) on Aug 29, 2012 at 04:05 UTC
    Dave,

    If the current Perl 5 interpreter has neither an AST nor bytecode representation, then what is the intermediate form created by the parser and executed by the runtime backend?

    I've heard the term "optree", is that it?

    Are we not considering the B::Bytecode system?

    Thanks,
    ~ Will
      Perl internally parses the code into a tree of OP structures (the "optree"). This is sort of like an AST without the "A" part. But by its nature it's never been designed to be easily manipulable in the way an AST is supposed to be. Also, the structure and specification of the parsed program isn't contained within the optree, its also spread out among stashes, globs, pads etc.

      When runtime is reached, each OP contains a pointer to the next op in execution sequence (op_next), plus a pointer to a C function that can carry out the action of that op (op_ppaddr). The main execution loop of perl (the "runloop") consists of calling the op_ppaddr function of the current op, then setting the current op to be whatever that function returns; repeat until a NULL is returned.

      OPs (and the pp* functions which implement them) are unlike bytecode: bytecode consists of small, lightweight ops, with the expectation that they can can be easily converted into equivalent native machine code via JIT etc. Or to but it another way, they don't contain much switching logic themselves: switching is done by adding in switching bytecodes. A tracing JIT can then follow what paths are taken through the bytecode, and pick a certain path (no switching), and convert that sequence of bytecode into a sequence of machine instructions.

      Perl ops are heavyweight: within each one there may be a lot of switching action. For example, the perl "add" op examines its two args: checks if they're overloaded, or tied, or have other magic associated with them, and if so handles that. Then checks whether the args are strings or other non-num things, and if so numifies them. Then adds them. Then checks for overflow: if overflow has occurred, see if the overflow could be avoided by upgrading from integer to float, and if so, do so.

      (update: so what I meant to say at this point is that the perl optree is a bastard hybrid of an AST and bytecode, and has to serve both puprposes, not always comfortably)

      B::Bytecode is just a way to serialise the optree (and associated state) into a platform neutral file. To execute it, the file is read in, and used to reconstruct the optree and state, then execution continues as before. It provides no runtime speedup, but was intended to speed startup by skipping the compilation phase: but in practice few gains were seen.

      Dave

        Dave,

        I appreciate the optree lesson! :)

        I've used B::Bytecode, so I know that it doesn't always improve startup time (but sometimes it does).

        Now I think I understand a bit more about the importance of moving the magic ("switching"?) from the ops to the SVs, as well as the difficulty of converting from optree to some intermediate form like an actual AST or bytecode or anything useful outside of Perlguts.

        I'm sure this info will help in our development planning.

        Thanks,
        ~ Will
Re^5: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by BrowserUk (Pope) on Aug 29, 2012 at 09:30 UTC

    When in order to call this:

    void _24to32( SV* packed ) { IS_VARS; char *pp = SvPVX( packed ); _4BY6 up; int i; IS_RESET; for( i=0; i<24; i+=3 ) { up.packed = _byteswap_ulong( *(unsigned long*)&pp[ i ] ); IS_PUSHUV( up.u.a ); IS_PUSHUV( up.u.b ); IS_PUSHUV( up.u.c ); IS_PUSHUV( up.u.d ); } IS_DONE; return; }

    You have to go through this lot:

    void XS_main__24to32(register PerlInterpreter* my_perl , CV* cv); void XS_main__24to32(register PerlInterpreter* my_perl , CV* cv) { extern int Perl___notused ; SV **sp = (*Perl_Istack_sp_ptr(((PerlI +nterpreter *)Perl_get_context()))); I32 ax = (*(*Perl_Imarkstack_ptr_ +ptr(((PerlInterpreter *)Perl_get_context())))--); register SV **mark += (*Perl_Istack_base_ptr(((PerlInterpreter *)Perl_get_context()))) + +ax++; I32 items = (I32)(sp - mark); #line 179 "_24to32.c" if (items != 1) Perl_croak_xs_usage(((PerlInterpreter *)Perl_get_context()), cv +,"packed"); ((void)ax); sp -= items; { SV * packed = (*Perl_Istack_base_ptr(((PerlInterpreter *)Perl_g +et_context())))[ax + (0)]; #line 117 "_24to32.xs" I32* temp; #line 188 "_24to32.c" #line 119 "_24to32.xs" temp = (*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl_get_cont +ext())))++; _24to32(packed); if ((*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl_get_context +()))) != temp) { (*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl_get_context() +))) = temp; do { do { const IV tmpXSoff = (0); (*Perl_Istack_sp_ptr(((PerlIn +terpreter *)Perl_get_context()))) = (*Perl_Istack_base_ptr(((PerlInte +rpreter *)Perl_get_context()))) + ax + (tmpXSoff - 1); return; } whil +e (0); } while (0); } return; #line 199 "_24to32.c" (*Perl_Istack_sp_ptr(((PerlInterpreter *)Perl_get_context()))) = s +p; return; } }

    and this lot:

    void _24to32( SV* packed ) { SV **sp = (*Perl_Istack_sp_ptr(((PerlInterpreter *)Perl_get_contex +t()))); I32 ax = (*(*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl +_get_context())))--); register SV **mark = (*Perl_Istack_base_ptr(((P +erlInterpreter *)Perl_get_context()))) + ax++; I32 items = (I32)(sp - + mark); char *pp = ((packed)->sv_u.svu_pv); _4BY6 up; int i; sp = mark; for( i=0; i<24; i+=3 ) { up.packed = _byteswap_ulong( *(unsigned long*)&pp[ i ] ); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.a)))); } while (0); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.b)))); } while (0); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.c)))); } while (0); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.d)))); } while (0); } (*Perl_Istack_sp_ptr(((PerlInterpreter *)Perl_get_context()))) = s +p; return; }

    You really don't see any opportunities for some radical optimisations?

    And remember, that is positively lightweight compared to the unoptimised C code produced for all of Perl's internal opcodes. Existing C compilers may be able to optimise some of that lot away on a function-by-function basis, but how much?

    And now consider the possibilities of allowing the optimiser to look at *all* the internal functions and look for really radical optimisations?

    Consider the possibilities of LTO and whole program analysis to lift whole chunks of that boiler plate above up to the runloop level on a per interpreter basis?

    Then consider the possibilities of JIT noticing that the context doesn't (cannot) change across a whole swath of runtime code and reserving a register, say one of the unused segment registers, for it and using register relative addressing for each interpreter?

    And then consider that LLVM doesn't have to follow C rules. It can invent weird stuff that C compilers (and C programmers) wouldn't even think of -- see my earlier example of it converting (at the IF level) an array of 52 shorts into a single 832-bit integer.

    What might it do with the whole SVt_* type hierarchy? Imagine (for sake of totally speculative example) that it could reduce *all* the SV flag tests & sets to a single, simple bit manipulation at some calculated bit offset into a huge integer. Is that possible? Would it result in substantial savings if used uniformly throughout the code base?

    Does any of this intrigue you? Even if only so that you can say: I told you so?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

    /div/div
      +1 Inspiring! XD
      What you have shown is XS code compiled using the most inefficient, backwards-compatible mode.

      If you compile it with #define PERL_NO_GET_CONTEXT at the top of the file, you'll find that all those calls to Perl_get_context() are avoided. Similarly, using perl 5.14.0 or later removes all those Perl_Istack_base_ptr()-style function calls.

      All those checks on stack size could be replaced by using a single EXTEND(24) at the start of the function.

      All those calls to create mortals could be removed by returning an array rather than a list. Etc.

      A lot of the macros designed for use by XS are less efficient (depending on circumstances) than the macros use in the perl core, due to the need to ensure backwards compatibility, or to insulate the XS code from the internal details of the core etc.

      If you look at the actual code in the hot pp ops, you'll find its hardly unoptimised. And if a modern C compiler can't cope with a a few extra no-op {} scopes added by perl's macros, there's something wrong with it. Show me some actual assembly of a function in pp_hot.c that's being significantly crippled by the design of perl's macros, and I'll start to take it seriously.

      Other than that, everything you've talked about is wild speculation so far.

      Dave.

        What you have shown is XS code compiled using the most inefficient, backwards-compatible mode. If you compile it with ... All those checks ... All those calls ... macros designed for use by XS are less efficient ...

        So, what you are saying is, if every XS-dabbling programmer, became an XS expert and learnt all the rules and tricks and techniques; and then they all modified all of their modules and programs, then things would run faster.

        Wouldn't it be nice if we had tools that took care of that?

        And if a modern C compiler can't cope with a a few extra no-op {} scopes added by perl's macros, there's something wrong with it.

        If C compilers were able to optimise away all that stuff, then wouldn't #define PERL_NO_GET_CONTEXT be unnecessary? An effective noop under optimisation?

        Compile-time optimisers cannot optimise across compile unit boundaries. C compilers are beginning to do LTO, but at a very limited level.

        If you used LLVM simply as an alternative C compiler, it couldn't do much more than modern C compilers do, but it is capable of doing so much more.

        Show me some actual assembly of a function in pp_hot.c that's being significantly crippled by the design of perl's macros, and I'll start to take it seriously.

        You could show me ...

        But who decided what was "hot"? On the basis of tracing what code?

        If the functions in pp_hot.c have come in for some special attention that has proven demonstratively worth that effort, isn't it possible that programs that use function outside of pp_hot.c might benefit if the functions they use came in for similar treatment?

        And isn't it just possible that a radically different (un-C-like) alternative like LLVM might be able to make pp_hot.c type changes elsewhere, in an automated manner?

        And maybe even find other changes that C compilers and programmers wouldn't even consider?

        Other than that, everything you've talked about is wild speculation so far.

        Agreed. Speculation based on some two years (on and off) of looking at what LLVM can, and is, doing, but still speculation. And clearly labeled as such.

        And it will remain that way until someone tries it. (Aren't you in the least bit intrigued?)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        p

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://990345]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2014-09-18 00:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (101 votes), past polls