http://www.perlmonks.org?node_id=990335


in reply to Re^2: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
in thread Perl 5 Optimizing Compiler, Part 4: LLVM Backend?

dave_the_m & chromatic,

I think BrowserUK's 3-phase idea (pasted below) and Yuval's comments are both in the same vein. I think this general Perl5-to-LLVM concept is valid, and although it might very well take more than a year to fully implement, I believe we could have some kind of initial demo ready by YAPC::NA 2013 in Austin.

dave_the_m, discounting only the amount of work needed to fully implement Perl5-to-LLVM, would you please tell me which part of BrowserUK's 3-phase idea is not producing the "I see how that might work!" epiphany in your mind?

Thanks,
~ Will


BrowserUK wrote:

1. There is the bit of perl that parses the source code and builds the AST from the Perl programs source. This would need to be left pretty much as is. The whole perl defines Perl thing. This would need to be compiled (back to) C and then linked into an executable.

2. Then there is the bit that converts the AST into byte code. I believe that this would need to be separated out and converted to produce IF.

3. Then there is the bit of perl that ordinarily runs the bytecode. Not just the runloop, but all the code the runloop dispatches to. I think that should be compiled to IF and built into an IF library (.bc) It would be "combined" with the IF from stage 2, at runtime, and given to the JIT to convert to machine code.
  • Comment on Re^3: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?

Replies are listed 'Best First'.
Re^4: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by chromatic (Archbishop) on Aug 29, 2012 at 04:25 UTC

    Perl 5 doesn't run slower than C because it lacks a JIT. Perl 5 runs slower than C because it's designed to be more flexible than C and that flexibility is not free. That flexibility includes but is not limited to:

    • reference counting
    • automatic destruction at scope leave
    • flexible typing and coercion
    • eval
    • dynamic dispatch
    • method dispatch
    • overloading
    • autoloading
    • importing
    • pragmas
    • monkeypatching
    • metaprogramming
    • closures
    • symbolic references
    • exceptions
    • dynamic data structures

    Add to that the facts that LLVM was not designed for dynamic languages and that Perl 5 was not designed for static optimizations. Using LLVM from Perl 5 without changing the internals of the Perl 5 VM probably will speed up code that's already written in C but certainly without addressing most of the reasons why Perl 5 code can run slower than C.

      chromatic,

      I guess I'm a bit confused. Above you say:
      Perl 5 doesn't run slower than C because it lacks a JIT.
      But here you say:
      Add a tracing JIT. Iterate on that for a few years.
      So which is it? JIT good or JIT bad?

      ...

      Other than that potential logical inconsistency, I am in general agreement with your premises of Perl-runs-slow-because-it-is-dynamic and LLVM-was-not-designed-for-dynamic-languages. However, until a better backend target is determined, it looks like LLVM may be one of our best options.

      If we can clean up some of the Perl internals along the way to Perl5-on-LLVM, then we can either change the backend target to a more-dynamic-than-LLVM platform or just upgrade LLVM to support dynamic language features.

      Thanks,
      ~ Will
        So which is it? JIT good or JIT bad?

        That reads to me like a false dilemma.

        In the rest of the post where I said "Add a tracing JIT", I wrote about the other necessary improvements to take advantage of a tracing JIT.

        If we can clean up some of the Perl internals along the way to Perl5-on-LLVM, then we can either change the backend target to a more-dynamic-than-LLVM platform or just upgrade LLVM to support dynamic language features.

        The word "just" in there reads to me like an oversimplification.

        Have you read the Dragon book? Taken a compilers class? Written your own Scheme or Forth? Written an interpreter? I appreciate your enthusiasm, but I think you're committing to a schedule and a lot of engineering decisions before you've done enough research to know what goes into a project like this.

Re^4: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by dave_the_m (Monsignor) on Aug 28, 2012 at 22:30 UTC
    discounting only the amount of work needed to fully implement Perl5-to-LLVM, would you please tell me which part of BrowserUK's 3-phase idea is not producing the "I see how that might work!" epiphany in your mind?
    Well, since the current perl interpreter has neither an AST, nor bytecode, I don't really understand the proposal.

    But my basic issue in this case is how do you get the 5X speedup rather than the 10%? I understand in great detail exactly what C-level code the perl interpreter executes while running an ops loop (and the ops within it). No-one has yet explained to me in a way I can understand how all the necessary stuff to carry out those actions will somehow go faster.

    For example, to do a subroutine call in perl, you need to do a whole bunch of stuff like

    • set up @_ and copy the stack items to it;
    • set up a context stack entry so that return etc know what to do;
    • create a new scope stack level;
    • bump the reference count of the live CV so that it can't be freed in mid-execution
    • make the appropriate pad the current one;
    • etc.
    All (or at least the vast majority) of this has still to be done, unless the perl interpreter is radically redesigned somehow. At the moment all this is done by a single, tuned C function. I don't see LLVM making it significantly faster.

    Dave

      When in order to call this:

      void _24to32( SV* packed ) { IS_VARS; char *pp = SvPVX( packed ); _4BY6 up; int i; IS_RESET; for( i=0; i<24; i+=3 ) { up.packed = _byteswap_ulong( *(unsigned long*)&pp[ i ] ); IS_PUSHUV( up.u.a ); IS_PUSHUV( up.u.b ); IS_PUSHUV( up.u.c ); IS_PUSHUV( up.u.d ); } IS_DONE; return; }

      You have to go through this lot:

      void XS_main__24to32(register PerlInterpreter* my_perl , CV* cv); void XS_main__24to32(register PerlInterpreter* my_perl , CV* cv) { extern int Perl___notused ; SV **sp = (*Perl_Istack_sp_ptr(((PerlI +nterpreter *)Perl_get_context()))); I32 ax = (*(*Perl_Imarkstack_ptr_ +ptr(((PerlInterpreter *)Perl_get_context())))--); register SV **mark += (*Perl_Istack_base_ptr(((PerlInterpreter *)Perl_get_context()))) + +ax++; I32 items = (I32)(sp - mark); #line 179 "_24to32.c" if (items != 1) Perl_croak_xs_usage(((PerlInterpreter *)Perl_get_context()), cv +,"packed"); ((void)ax); sp -= items; { SV * packed = (*Perl_Istack_base_ptr(((PerlInterpreter *)Perl_g +et_context())))[ax + (0)]; #line 117 "_24to32.xs" I32* temp; #line 188 "_24to32.c" #line 119 "_24to32.xs" temp = (*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl_get_cont +ext())))++; _24to32(packed); if ((*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl_get_context +()))) != temp) { (*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl_get_context() +))) = temp; do { do { const IV tmpXSoff = (0); (*Perl_Istack_sp_ptr(((PerlIn +terpreter *)Perl_get_context()))) = (*Perl_Istack_base_ptr(((PerlInte +rpreter *)Perl_get_context()))) + ax + (tmpXSoff - 1); return; } whil +e (0); } while (0); } return; #line 199 "_24to32.c" (*Perl_Istack_sp_ptr(((PerlInterpreter *)Perl_get_context()))) = s +p; return; } }

      and this lot:

      void _24to32( SV* packed ) { SV **sp = (*Perl_Istack_sp_ptr(((PerlInterpreter *)Perl_get_contex +t()))); I32 ax = (*(*Perl_Imarkstack_ptr_ptr(((PerlInterpreter *)Perl +_get_context())))--); register SV **mark = (*Perl_Istack_base_ptr(((P +erlInterpreter *)Perl_get_context()))) + ax++; I32 items = (I32)(sp - + mark); char *pp = ((packed)->sv_u.svu_pv); _4BY6 up; int i; sp = mark; for( i=0; i<24; i+=3 ) { up.packed = _byteswap_ulong( *(unsigned long*)&pp[ i ] ); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.a)))); } while (0); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.b)))); } while (0); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.c)))); } while (0); do { do { if ((*Perl_Istack_max_ptr(((PerlInterpreter *)Perl_g +et_context()))) - sp < (int)(1)) { sp = Perl_stack_grow(((PerlInterpr +eter *)Perl_get_context()), sp,sp,(int) (1)); } } while (0); (*++sp = + (Perl_sv_2mortal(((PerlInterpreter *)Perl_get_context()), Perl_newSV +uv(((PerlInterpreter *)Perl_get_context()), up.u.d)))); } while (0); } (*Perl_Istack_sp_ptr(((PerlInterpreter *)Perl_get_context()))) = s +p; return; }

      You really don't see any opportunities for some radical optimisations?

      And remember, that is positively lightweight compared to the unoptimised C code produced for all of Perl's internal opcodes. Existing C compilers may be able to optimise some of that lot away on a function-by-function basis, but how much?

      And now consider the possibilities of allowing the optimiser to look at *all* the internal functions and look for really radical optimisations?

      Consider the possibilities of LTO and whole program analysis to lift whole chunks of that boiler plate above up to the runloop level on a per interpreter basis?

      Then consider the possibilities of JIT noticing that the context doesn't (cannot) change across a whole swath of runtime code and reserving a register, say one of the unused segment registers, for it and using register relative addressing for each interpreter?

      And then consider that LLVM doesn't have to follow C rules. It can invent weird stuff that C compilers (and C programmers) wouldn't even think of -- see my earlier example of it converting (at the IF level) an array of 52 shorts into a single 832-bit integer.

      What might it do with the whole SVt_* type hierarchy? Imagine (for sake of totally speculative example) that it could reduce *all* the SV flag tests & sets to a single, simple bit manipulation at some calculated bit offset into a huge integer. Is that possible? Would it result in substantial savings if used uniformly throughout the code base?

      Does any of this intrigue you? Even if only so that you can say: I told you so?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

      /div/div
        What you have shown is XS code compiled using the most inefficient, backwards-compatible mode.

        If you compile it with #define PERL_NO_GET_CONTEXT at the top of the file, you'll find that all those calls to Perl_get_context() are avoided. Similarly, using perl 5.14.0 or later removes all those Perl_Istack_base_ptr()-style function calls.

        All those checks on stack size could be replaced by using a single EXTEND(24) at the start of the function.

        All those calls to create mortals could be removed by returning an array rather than a list. Etc.

        A lot of the macros designed for use by XS are less efficient (depending on circumstances) than the macros use in the perl core, due to the need to ensure backwards compatibility, or to insulate the XS code from the internal details of the core etc.

        If you look at the actual code in the hot pp ops, you'll find its hardly unoptimised. And if a modern C compiler can't cope with a a few extra no-op {} scopes added by perl's macros, there's something wrong with it. Show me some actual assembly of a function in pp_hot.c that's being significantly crippled by the design of perl's macros, and I'll start to take it seriously.

        Other than that, everything you've talked about is wild speculation so far.

        Dave.

        +1 Inspiring! XD
      Dave,

      If the current Perl 5 interpreter has neither an AST nor bytecode representation, then what is the intermediate form created by the parser and executed by the runtime backend?

      I've heard the term "optree", is that it?

      Are we not considering the B::Bytecode system?

      Thanks,
      ~ Will
        Perl internally parses the code into a tree of OP structures (the "optree"). This is sort of like an AST without the "A" part. But by its nature it's never been designed to be easily manipulable in the way an AST is supposed to be. Also, the structure and specification of the parsed program isn't contained within the optree, its also spread out among stashes, globs, pads etc.

        When runtime is reached, each OP contains a pointer to the next op in execution sequence (op_next), plus a pointer to a C function that can carry out the action of that op (op_ppaddr). The main execution loop of perl (the "runloop") consists of calling the op_ppaddr function of the current op, then setting the current op to be whatever that function returns; repeat until a NULL is returned.

        OPs (and the pp* functions which implement them) are unlike bytecode: bytecode consists of small, lightweight ops, with the expectation that they can can be easily converted into equivalent native machine code via JIT etc. Or to but it another way, they don't contain much switching logic themselves: switching is done by adding in switching bytecodes. A tracing JIT can then follow what paths are taken through the bytecode, and pick a certain path (no switching), and convert that sequence of bytecode into a sequence of machine instructions.

        Perl ops are heavyweight: within each one there may be a lot of switching action. For example, the perl "add" op examines its two args: checks if they're overloaded, or tied, or have other magic associated with them, and if so handles that. Then checks whether the args are strings or other non-num things, and if so numifies them. Then adds them. Then checks for overflow: if overflow has occurred, see if the overflow could be avoided by upgrading from integer to float, and if so, do so.

        (update: so what I meant to say at this point is that the perl optree is a bastard hybrid of an AST and bytecode, and has to serve both puprposes, not always comfortably)

        B::Bytecode is just a way to serialise the optree (and associated state) into a platform neutral file. To execute it, the file is read in, and used to reconstruct the optree and state, then execution continues as before. It provides no runtime speedup, but was intended to speed startup by skipping the compilation phase: but in practice few gains were seen.

        Dave

Re^4: Perl 5 Optimizing Compiler, Part 4: LLVM Backend?
by Anonymous Monk on Aug 28, 2012 at 22:03 UTC
    You know, you can cut out the italics and use <blockquote></blockquote> much easier to read