http://www.perlmonks.org?node_id=990751


in reply to Perl 5 Optimizing Compiler, Part 5: A Vague Outline Emerges

Ok, so I'm going to do the "pouring cold water on the project" thing. Sorry.

My overall conclusion is that this project is completely unrealistic, in terms of the technical challenge, the expected speed improvements, the timescale, and the availability of suitably qualified labour.

12 years ago, perl was at a crossroads. The experimental B:: suite had been written, and the experience gained from that showed that achieving significant performance gains was near impossible using the existing creaking perl core. After some mug throwing, the perl6 and parrot project was born. The Perl language would be redesigned to be more optimiser-friendly (e.g. optional types), while a new parser could target a proper bytecode, executed by a modern new VM. This new infrastructure would then give us the underlying system that facilitates doing all the sexy new compilerly/execution stuff like SSA, JIT, etc.

There was lots of initial enthusiasm, lots of volunteer manpower, and timescales of a year or two bandied about (IIRC). Even at the time the estimates seemed unrealistic to me, but who was I to say? Now, 12 years later (admittedly after personality clashes and fallings out etc), we still don't have a complete, production-ready system. This isn't meant as criticism of the parrot and perl6 teams, but rather to point out just how hard it is to re-implement something as complex and subtle as the perl syntax and semantics. Remember, just the lexer in perl5 is 12,000 lines of C code.

So I think we're about to repeat the same perl6 exercise (with admittedly more modest goals), with the same under-appreciation of its difficulty.

So, that was a general overview. Now onto more specifics. First, LLVM. I like LLVM, I think it's a cool system. If I was going to write perl from scratch now, I would very seriously consider using its IR as the target of the compiler. The issue here though, is how easy it would be to retro-actively make perl use LLVM in some way or another. I think you get a sliding scale, with the achievable bits giving small gains, and large gains only (possibly) possible by huge changes to the existing perl core: basically rewriting large parts of it (i.e. a project of parrot/perl6 scale).

There seem to be three ways suggested of using LLVM. First, there's the trivial sense of using it as a better C compiler. Anecdotal evidence indicates that it might give you 5-10% speedup. But that's not really what we're discussing here.

Secondly, there's Yuval's suggestion, which in its most basic (and thus 'doable' form) is to basically change the API of the pp_* functions to allow the runops loop to to be unrolled, and avoid the overhead of perl stack manipulation. It may also then allow LLVM to better optimise the resulting call tree, e.g. hefting stuff up to a higher level.

Looking at the purely loop unrolling / stack side of things, I did a rough calculation to see what improvement could be expected. I ran the following perl code: $z = $x + $y * 3; ($x and $y contain ints) in a loop under cachegrind, and used the results to estimate approximately what proportion of the execution is spent on loop/stack overhead. The answer was approx 20%. Note that the example above was chosen specifically to use simple perl ops where the overhead is most likely to dominate. Perl has a lot of heavyweight ops like pp_match(), where the relative overhead is going to be much smaller. Also, the LLVM verison will have an overhead of its own for setting up args and the current op, just hopefully less than the current overhead. So that means that overall, the loop/stack stuff is going to give us something less than 20% speedup (and probably a lot less in practice, unless all your perl code does is lots of lightweight stuff like additions). Set against this there might be improvements from LLVM being able to compile the resulting code better, but my speculation is that it won't be significant.

Yuval then listed some more speculative improvements that could then be built upon that basic stuff, but I still think the performance gains will be relatively modest. The advantage of the Yuval approach is that it should be achievable by relatively little effort and by relatively modest changes to the existing perl source.

The third thing to do with LLVM, (which is what I think BrowserUK is advocating, but I may be wrong), is the wholesale replacement of perl's current runtime, making perl's parser convert the op tree to LLVM IR "bytecode", and thus making a perl a first-class Perl-to-LLVM compiler, in the same way that clang is a first-class C-to-LLVM compiler. This would be a massive undertaking, involving understanding all the subtleties and intricacies of 25,000 lines of pp_* functions, and writing code that will emit equivalent IR - even assuming that you kept the rest of perl the same (i.e. still used SVs, HVs, the context stack, pads etc).

If you were to go the full hog and replace the latter with something "better", you're then into full perl6/parrot territory. You're at the point of rewriting most of perl from scratch: the point that I said earlier I might have started from if had to do over perl again from scratch. But I think that unless you wholesale throw away SVs, pads, etc, you're not going to see game-shifting performance gains. So I think this option can be summed up as "yes,I would have been nice if perl had originally been written to target LLVM, but wasn't, and the cost of retro-fitting is prohibitive".

Now we get onto "writing a perl compiler that targets something like javascript, which will be fast, because lots of money is being spent making javascript fast".

I see two main issues with this. Firstly, no one is ever going to write a perl parser that can fully parse perl and handle all it's subtleties. The best you can do is to parse the easy 90% and ignore the edge cases. But "Ah", I hear you think, "90% is good enough for most people. My code doesn't use ties or overloading". The problem is that within the 10% not covered, there will be 1% that your code does in fact use. "Your code uses /(?{})/? That's a shame.". If you're lucky, you'll get a compile-time error telling that feature X is unsupported. If you're unlucky (and you will be unlucky) your code will silently do the wrong thing due to some subtlety of auto-vivification, localization or stash manipulation or whatever.

Secondly, any impedance mismatch between perl and javascript is going to give you agonisingly slow performance. For example, if the full semantics or perl hashes can be provided by javascript dictionaries or objects say, then you can directly translate $foo{bar} into foo.bar or whatever. If however, the javascript facilities aren't suitable, then you might end up having to implement a Perl-style hash using javascript code, which is going to be slow. Also what tends to happen in these sorts of conversions is that the early proof-of-concept work (which uses a javascript dict say) works well and is really, really fast. Then you reach the point where you've done 50% of the work and its going really well, Then you get round to implementing the 'each' op, and suddenly realise that it can't be done using a javascript dict. So you switch to the slow route. NB: the hash is just a hypothetical example, which may or may not be a problem in javascript. The real point is that there are lots and lots of things in perl which have odd semantics, that can be superficially implemented in the "fast" way, but which may have to switch to a slow approach once a wider subset of its semantics are implemented.

That's the end of my rants about specific suggestions. I'll just add a few final general comments.

No matter what fancy tools or rewrites you use, you're always going to have some theoretical constraints on performance due to the nature of the perl language. For example, method dispatch in perl is always going to be cumbersome and un-optimisible, due to the way dispatch is based on point-in-time lookup in a stash. Yes, you can do clever tricks like caching, but perl already does this. There may be further tricks no-one's got round to thinking of yet (or at least not implementing yet), but such tricks are likely to be as applicable to the current perl code as to any LLVM or javascript variant. This really boils down to the fact that to see really significant performance changes, you have to change the perl language itself. Add a new type system, object system, or whatever.

The general experience so far of doing "clever stuff" in other languages like Python, often with considerable funding and resources, has been a catalogue or either failure, abandoned projects, or disappointing speedups. It doesn't necessarily follow from it that no-one should ever attempt such a thing again, but it does indicate that doing this sort of thing (and getting it right) is really hard. And the general impression I get from reading up about what "clever stuff" people have already been trying with perl (e.g. Yuval and LLVM), is that major performance improvements weren't the main consideration for the project.

Note that this is the last contribution I intend to make on this topic for the moment; I've already spent waaaay too much time reading up and discussing it.

PS: For anyone familiar with the "Big Talk" sketch in the UK comedy series "That Mitchell and Webb Look", I feel that this whole discussion has been rather similar to its "Come on, boffins!" ethos.

Replies are listed 'Best First'.
Re^2: Perl 5 Optimizing Compiler, Part 5: A Vague Outline Emerges
by BrowserUk (Patriarch) on Aug 30, 2012 at 15:59 UTC
    The third thing to do with LLVM, (which is what I think BrowserUK is advocating, but I may be wrong), is the wholesale replacement of perl's current runtime, making perl's parser convert the op tree to LLVM IR "bytecode", and thus making a perl a first-class Perl-to-LLVM compiler, in the same way that clang is a first-class C-to-LLVM compiler. This would be a massive undertaking, involving understanding all the subtleties and intricacies of 25,000 lines of pp_* functions, and writing code that will emit equivalent IR - even assuming that you kept the rest of perl the same (i.e. still used SVs, HVs, the context stack, pads etc).

    You're about 1/3rd of the way to what I was trying to suggest as a possibility. I'm going to try again. I hope you have the patience to read it. I'm going to start with an unrealistic scenario for simplicity and try to fill in the gaps later.

    Starting with a syntactically correct perl source that is entirely self-contained -- uses no modules or pragmas; no evals; no runtime code generation of any kind -- there are (notionally, no matter how hard it is linguistically to describe them; or practically to separate them ), three parts involved in the running of that program:

    1. The parsing of the source code and construction of the perl internal form -- call it a tree or graph; bytecode or opcodes -- for want of a term and some short-hand, the Perl Code Graph (PCG).

      Part 1 cannot be changed. it *is* Perl. So segregate it (I know; I know) out into a separately compiled and linked, native code unit.

      A dll (loaded by the minimal perl.exe much as perl5.x.dll is today), that reads the source file and builds exactly whatever it builds now, and then gets the hell out of dodge, leaving the PCG behind in memory.

    2. The interpreter proper -- the runloop -- that processes the PCG and dispatches to the Perl runtime (PRT).

      Moved below, because I need you to understand the context above and below, before the description of this middle bit will make sense.

    3. The Perl runtime -- the functions that do the actual work.

      Part 3 is very hard to re-code, as much of the behavioral semantics of perl is encapsulated entirely within it.

      So, give the whole kit&caboodle -- all the pp_* source code and dependencies -- to LLVM using its C front end, to process into LLVM intermediate form (IF), and then pass that through the various IF optimising stages until it can do no more, and then write it in its optimised IF form to a file (PRT.bc).

      This process is done once (for each release) by "the devs". The optimised PRT.bc file is platform independant and can be distributed as part of the build -- at the risk of the hackles it will raise including mine -- a bit like MSCRT.dll, but platform independent.

      This single binary file contains all the 'dispatched to' functions and their dependencies, pre-optimised as far as that can go, but still in portable IF form.

    Part 2. The only new code that needs to be written. But even this already exists in the form of -MO-Deparse.

    New code is adapted from Deparse. It processes the PCG in the normal way, but instead of (re)generating the Perl source code, it generates a LLVM IF "copy" of the PCG it is given. Let's call that the LLCG.

    The LLCG is now the program we started with, but in a platform independent, optimisible (also platform independent) form that can be

    • Saved to disk in any of LLVMs IF forms -- text or binary -- and passed to other platforms, or reloaded from disk at a later stage.
    • Or it can be passed directly to the LLVMI (IF interpreter) along with PRT.bc, to be interpreted immediately.
    • Or it can be passed, along with PRT.bc, to the LLVM JIT, and it can generate the native platform code, with optimisations, that is then executed.

    I hope that is clearer than my previous attempts at description.

    • All of the above is possible.
    • None of it requires starting from scratch.
    • None of it means changing Perl's syntax or discarding any of perl's features.
    • It doesn't even require the discard of the existing Perl runloop or runtime.

      Distribute the PRT in fully linked binary form (ie. perl5.x.dll/.so, albeit with some of its current contents split out), and you effectively have bog standard perl.

      It would require a command line switch to invoke the LLVM stuff.

      Very little new code is needed. Essentially, just the generation of the LLCG from the PCG, and half of that already exists.

    • It is a very low-risk, low-initial effort strategy.

    I'm fully aware that perl frequently reinvokes the parser and runloop in the process of compiling the source of a program, in order to deal with used modules and pragmas and BEGIN/CHECK/UNITCHECK/INIT/END blocks. Effectively, each alternation or recursion would be processed the same way as the above standalone program. If the module has previously been save in .bc form, the parsing and PCG->LLCG conversion can be skipped.

    The first step, and perhaps the hardest part getting started, would be the re-engineering the existing build process -- and a little tweaking of the source files -- to break apart the code needed for parts 1 & 2 from part 3, so they can be built into separate dlls -- and the latter into PRT.bc. This process may result in some duplication as perl tends to use internally some of the same stuff that it provides to Perl programs as runtime.

    These modifications to the build process and splitting out of the parser/PCG generation from the runtime could be done and used by the next release (or the one after that) of the existing Perl distribution. without compromising it.

    It would not be trivial and it would require some one with excellent knowledge of both the internals and the build process -- ie. YOU! -- but it wouldn't be a huge job, and it needn't be a throwaway if all the rest failed or went nowhere. It might even benefit the existing code base and build system in its own right.

    I'm done. If that fails to clarify or persuade, so be it. I'll respond to direct questions should there be any, but no more attempts to change anyones mind :)

    In the unlikely event you read to here, thank you for your time and courtesy.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      I understand that strategy, and it's probably easier to start to see results than most other approaches, but keep in mind a few drawbacks:

      • You have to compile every part of the system to LLVM bitcode, including the Perl core, any modules you use, and every XS component. This is cacheable, but you still have to do it.
      • Before running a program, you have to link all of the bitcode together into a single image. This image will be large.
      • Before running a program, you really want to run as many of LLVM's optimization stages as possible. This takes time and memory.
      • You may be able to emit native code which represents only that program and execute that. Essentially you're replicating what PAR does, with all of the caveats about program analysis and requirement discoverability.

      I expect the resulting binaries to be huge. I expect the link and optimization steps to take a long time. I expect you'll have to keep the Perl 5 interpreter around anyway because there are things you just can't do without either rewriting the language (to get around the BEGIN/import symbol declaration dance, for example. I don't know if you have to include any LLVM runtime components.

      I can imagine that you can optimize some programs with this approach, but I don't know that the intrinsic overhead in the Perl 5 VM you get is more than 10%. Maybe link-time optimization can cut out another 10%. Part of that is the inherent flexibility of Perl 5's design, and part of that is that LLVM is at heart a really good compiler for languages that act like C++.

        I'm not sure there is a direct question in there for me to respond to, so all I'll say is that it is a minimal up-front effort strategy that puts the hooks in place that would allow some real measurements to be taken.

        And even that minimal effort needn't be discarded if the LLVM experiments fail, because it leaves an intact perl distribution that works just as efficiently and effectively as it does now, with the only difference being that the compiled perl executable comes in 3 parts rather than the current two. (In windows terms: perl.exe dynalinked to perl5.x.part1&2.dll & perl5.x.part3.dll)

        And that split might even benefit the normal distribution in some way.

        Once the hooks were in there, the ability to pass the PCG to some other dll, module or process, might allow other investigations to move forward.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        /code
      (Damn, I said i wouldn't be responding further...)

      New code is adapted from Deparse. It processes the PCG in the normal way, but instead of (re)generating the Perl source code, it generates a LLVM IF "copy" of the PCG it is given. Let's call that the LLCG
      This is the bit I don't currently get. take a piece of code like
      $m = ($s =~ /foo/);

      Which is compiled to an optree that looks like:

      7 <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 1 p:3) v:{ ->3 6 <2> sassign vKS/2 ->7 4 </> match(/"foo"/) sKPS/RTIME ->5 - <1> ex-rv2sv sK/1 ->4 3 <#> gvsv[*s] s ->4 - <1> ex-rv2sv sKRM*/1 ->6 5 <#> gvsv[*m] s ->6
      In general terms, what would the IR look like that you would convert that into?

      Dave.

        A first reaction (I'm getting punchy at this point in time.). If you use LLVM as just an alternative C compiler, then as a part of the process of compiling perl -- unchanged -- it will compile whatever code/functions/source file(s) that constitute the current "runloop" (ostensibly runops_standard in run.c).

        One of the possible variations of using LLVM in this mode, is that it (clang) can output .bc (bitcode) files, that can later be linked together -- using the LLVM linker -- to produce a native executable. What's more, is that the LLVM linker is quite happy to accept some "object files" that are in .bc format, and some that are in the normal .obj/.o format, and link them together and produce a (normal) platform dependent executable.

        It is also possible, to have clang produce LLVM IF in text form.

        So, in theory, if we ran (something like; there's a lot of documentation) clang --emit-text run.c -O run.o and then inspected run.o in a text editor, it would tell us exactly what the IF looks like for that source file.

        And that IF, would (in its binary form), be combinable -- with all the other normal object files produced using gcc or cl.exe -- using the LLVM linker, to produce a working, native compiled executable.

        That is not a direct answer to your question, but the point is that (as a starting point), it is possible to build a working executable, by substituting any individual clang-compiled-to-bitcode-source-file, for the native compiled objct file from that source, and combine it with all the other GGC/CL produced object files, and the LLVM linker will happily combine them into into a native executable.

        Thus, to see what the LLVM IF look like, for any given source file, you only need to use clang to compile that individual source file to its text representation. You don't gain any performance, but you do get to see what LLVM IF looks like.

        I'll attempt to get back to you with a specific answer to your question, but given that my LLVM installation if 2 years old, and my primary perl installation about the same, It'll take a couple of days to get caught up.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        Okay. I compiled run.c -> run.e using the cl /E. It produced 30,000 lines of post-precprocessed C. Most of which is unrelated to Perl having been pulled in from a crap load of OS header files. (I won't post it here, it's too big and entirely uninteresting anyway.)

        I then threw that file at clang -- all 30,000 lines of it -- and asked it to convert it to LLVM assembler with no optimisation. It took most of the day cleaning up stuff that clang is really pedantic about -- it deosn't allow duplicate typedefs, even if they are identical except whitespace; it doesn't like MSC source code annotations or accept most of their pragmas; and it doesn't like prototypes without ; on the end ... and there were 1000s of them produced from the perl headers -- but 9 hours work and I got there.

        It produced these 572 lines:

        I then asked it to optimise it. This time it produced just these 363 lines:

        Now looking at that, I can see that it has built data descriptions for a crapload of Windows internal data structures, so I've manually (and conservatively) removed anything that I don't think are used by Perl. (I could have done this at the /e stage, but it was *much* easier reading 700 lines that 30000 lines :)

        What I've ended up with is these 112 lines of IR:

        ; ModuleID = '/tmp/webcompile/_9304_0.bc' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i6 +4:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f +80:128:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" %struct._TP_CALLBACK_ENVIRON = type { i64, %struct._TP_POOL*, %struct. +_TP_CLEANUP_GROUP*, void (i8*, i8*)*, i8*, %struct._ACTIVATION_CONTEX +T*, void (%struct._TP_CALLBACK_INSTANCE*, i8*)*, %union.anon } %struct._TP_POOL = type opaque %struct._TP_CLEANUP_GROUP = type opaque %struct._ACTIVATION_CONTEXT = type opaque %struct._TP_CALLBACK_INSTANCE = type opaque %union.anon = type { i64 } %struct.interpreter = type { %struct.sv**, %struct.op*, %struct.sv**, +%struct.sv**, %struct.sv**, i64*, i8**, i64, i64, %union.any*, i64, i +64, %struct.sv**, i64, i64, i64, i64, i64*, i64*, i64*, %struct.sv*, +%struct.xpv*, i64, %struct._stat64, %struct._stat64, %struct.gv*, %st +ruct.sv*, %struct.tms, %struct.pmop*, %struct.sv*, %struct.gv*, %stru +ct.gv*, %struct.gv*, i8*, %struct.sv*, %struct.sv*, %struct.sv*, %str +uct.hv*, %struct.hv*, %struct.op*, %struct.jmpenv*, %struct.cop*, %st +ruct.av*, %struct.stackinfo*, %struct.av*, %struct.jmpenv*, %struct.j +mpenv, %struct.sv*, %struct.he*, %struct.op*, %struct.op*, %struct.hv +*, %struct.gv*, %struct.gv*, i8*, i64, i64*, i64*, %struct.sv*, %stru +ct.re_save_state, %struct.regnode, i16, i8, i8, [6 x i8*], void (%str +uct.interpreter*, %struct.op*)*, void (%struct.interpreter*, %struct. +op*)*, void (%struct.interpreter*, %struct.op*)*, i64, i64, i8**, i8* +, %struct.regmatch_slab*, %struct.regmatch_state*, i16, i8, i8, i8, i +8, i32, i8, i64, i32, i8**, %struct.gv*, %struct.gv*, %struct.gv*, i8 +*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, i8**, i8*, i8, + i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8*, %struct +.sv*, i64, %struct.sv*, i64, i64, i64, i32, i32*, %struct.gv*, %struc +t.gv*, %struct.gv*, %struct.gv*, %struct.gv*, %struct.av*, %struct.gv +*, %struct.gv*, %struct.gv*, %struct.gv*, %struct.gv*, %struct.sv*, % +struct.sv*, %struct.sv*, %struct.av*, %struct.hv*, %struct.hv*, %stru +ct.sv*, %struct.av*, %struct.av*, %struct.av*, %struct.av*, %struct.a +v*, %struct.hv*, i64, i32, i64, i64, %struct.sv*, %struct.sv*, %struc +t.av*, i8*, %struct.cv*, %struct.op*, %struct.op*, %struct.op*, %stru +ct.op*, %struct.cop*, i32, i32, i8*, i8**, i8*, %struct.av*, %struct. +sv*, %struct.sv*, i64, i8, i8, i16, i32, i64, %struct.exitlistentry*, + %struct.hv*, i64*, %struct.cop, %struct.cv*, %struct.av*, %struct.av +*, i64, i64, %struct.interp_intern, %struct.cv*, i32, i8, i8, i8, i8, + i64, i64, i64, i64, i64, i64, i64, i64, i8**, i8*, void (i32)*, [16 +x i8*], i64, i32, {}*, %struct.sv, %struct.sv, %struct.sv, %struct.sv +*, i64, i64, i64, i64, i64, i64, i64, i64, i64, i8*, i64, i64, i64, i +8, i8, i8, i8, i8*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv +*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, % +struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %stru +ct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.s +v*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, +%struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %struct.sv*, %str +uct.hv*, i8*, i64, [10 x i8], i8, i8, i32, %struct.yy_parser*, %struc +t.sv**, %struct.sv**, %struct.ptr_tbl*, %struct.av*, i8*, %struct.sv* +, %struct.sv**, %struct.av*, %struct.REENTR*, %struct.hv*, %struct.hv +*, %struct._PerlIO*, %struct.PerlIO_list_s*, %struct.PerlIO_list_s*, +%struct.sv*, %struct.perl_debug_pad, %struct.sv*, %struct.sv*, %struc +t.sv*, %struct.sv*, i64 (%struct.interpreter*, %struct.sv*, %struct.s +v*)*, %struct.av*, %struct.av*, i64, i64, i32, %struct.hv*, void (%st +ruct.interpreter*, %struct.sv*)*, void (%struct.interpreter*, %struct +.sv*)*, void (%struct.interpreter*, %struct.sv*)*, {}*, void (%struct +.interpreter*)*, i64, i64, %struct.hv*, i32, i8**, i8 (%struct.interp +reter*, %struct.sv*)*, %struct.hv*, %struct.av*, %struct.hv*, %struct +.hv*, %struct.hv* } %struct.sv = type { i8*, i64, i64, %union.anon.0 } %union.anon.0 = type { i8* } %struct.op = type { %struct.op*, %struct.op*, %struct.op* (%struct.int +erpreter*)*, i64, [2 x i8], i8, i8 } %union.any = type { i8* } %struct.xpv = type { %struct.hv*, %union._xmgu, i64, i64 } %struct.hv = type { %struct.xpvhv*, i64, i64, %union.anon.3 } %struct.xpvhv = type { %struct.hv*, %union._xmgu, i64, i64 } %union._xmgu = type { %struct.magic* } %struct.magic = type { %struct.magic*, %struct.mgvtbl*, i16, i8, i8, i +64, %struct.sv*, i8* } %struct.mgvtbl = type { i32 (%struct.interpreter*, %struct.sv*, %struc +t.magic*)*, i32 (%struct.interpreter*, %struct.sv*, %struct.magic*)*, + i64 (%struct.interpreter*, %struct.sv*, %struct.magic*)*, i32 (%stru +ct.interpreter*, %struct.sv*, %struct.magic*)*, i32 (%struct.interpre +ter*, %struct.sv*, %struct.magic*)*, i32 (%struct.interpreter*, %stru +ct.sv*, %struct.magic*, %struct.sv*, i8*, i64)*, i32 (%struct.interpr +eter*, %struct.magic*, %struct.clone_params*)*, i32 (%struct.interpre +ter*, %struct.sv*, %struct.magic*)* } %struct.clone_params = type { %struct.av*, i64, %struct.interpreter*, +%struct.interpreter*, %struct.av* } %struct.av = type { %struct.xpvav*, i64, i64, %union.anon.2 } %struct.xpvav = type { %struct.hv*, %union._xmgu, i64, i64, %struct.sv +** } %union.anon.2 = type { i8* } %union.anon.3 = type { i8* } %struct._stat64 = type { i32, i16, i16, i16, i16, i16, i32, i64, i64, +i64, i64 } %struct.gv = type { %struct.xpvgv*, i64, i64, %union.anon.7 } %struct.xpvgv = type { %struct.hv*, %union._xmgu, i64, i64, %union._xi +vu, %union._xnvu } %union._xivu = type { i64 } %union._xnvu = type { %struct.anon.5 } %struct.anon.5 = type { i64, i64 } %union.anon.7 = type { i8* } %struct.tms = type { i64, i64, i64, i64 } %struct.pmop = type { %struct.op*, %struct.op*, %struct.op* (%struct.i +nterpreter*)*, i64, [2 x i8], i8, i8, %struct.op*, %struct.op*, i64, +i64, %union.anon.12, %union.anon.13 } %union.anon.12 = type { %struct.op* } %union.anon.13 = type { %struct.op* } %struct.jmpenv = type { %struct.jmpenv*, [16 x i32], i32, i8 } %struct.cop = type { %struct.op*, %struct.op*, %struct.op* (%struct.in +terpreter*)*, i64, [2 x i8], i8, i8, i64, i8*, i8*, i64, i64, i64*, % +struct.refcounted_he* } %struct.refcounted_he = type opaque %struct.stackinfo = type { %struct.av*, %struct.context*, %struct.stac +kinfo*, %struct.stackinfo*, i64, i64, i64, i64 } %struct.context = type { %union.anon.14 } %union.anon.14 = type { %struct.block } %struct.block = type { i8, i8, i16, i64, %struct.cop*, i64, i64, %stru +ct.pmop*, %union.anon.15 } %union.anon.15 = type { %struct.block_sub } %struct.block_sub = type { %struct.op*, %struct.cv*, %struct.av*, %str +uct.av*, i64, %struct.av* } %struct.cv = type { %struct.xpvcv*, i64, i64, %union.anon.11 } %struct.xpvcv = type { %struct.hv*, %union._xmgu, i64, i64, %struct.hv +*, %union.anon.9, %union.anon.10, %struct.gv*, i8*, %struct.av*, %str +uct.cv*, i64, i16, i64 } %union.anon.9 = type { %struct.op* } %union.anon.10 = type { %struct.op* } %union.anon.11 = type { i8* } %struct.he = type { %struct.he*, %struct.hek*, %union.anon.1 } %struct.hek = type { i64, i64, [1 x i8] } %union.anon.1 = type { %struct.sv* } %struct.re_save_state = type { i64, i64, i64, i8, i8*, i8*, i8*, %stru +ct.regexp_paren_pair*, i64*, i64*, i8**, %struct.magic*, %struct.pmop +*, %struct.pmop*, i8*, i64, i64, i64, i64, i64, i64, i8*, i8* } %struct.regexp_paren_pair = type { i64, i64 } %struct.regnode = type { i8, i8, i16 } %struct.regmatch_slab = type { [42 x %struct.regmatch_state], %struct. +regmatch_slab*, %struct.regmatch_slab* } %struct.regmatch_state = type { i32, i8*, %union.anon.22 } %union.anon.22 = type { %struct.anon.26 } %struct.anon.26 = type { %struct.regmatch_state*, i64, i64, i64, i16*, + %struct.regnode*, %struct.regnode*, i8*, i64, i16, i16, i8 } %struct.exitlistentry = type { void (%struct.interpreter*, i8*)*, i8* +} %struct.interp_intern = type { i8*, i8**, i64, %struct.av*, %struct.ch +ild_tab*, i64, %struct.pseudo_child_tab*, i8*, %struct.thread_intern, + %struct.HWND__*, i32, i32, [27 x void (i32)*] } %struct.child_tab = type { i64, [64 x i64], [64 x i8*] } %struct.pseudo_child_tab = type { i64, [64 x i64], [64 x i8*], [64 x % +struct.HWND__*], [64 x i8] } %struct.HWND__ = type { i32 } %struct.thread_intern = type { [512 x i8], %struct.servent, [128 x i8] +, i32, [30 x i8], i32, i16 } %struct.servent = type { i8*, i8**, i16, i8* } %struct.yy_parser = type { %struct.yy_parser*, %union.YYSTYPE, i32, i3 +2, i32, i32, %struct.yy_stack_frame*, %struct.yy_stack_frame*, i64, i +64, i8*, i8*, i8, i8, i8, i8, i64, %struct.op*, %struct.op*, %struct. +sv*, i16, i16, i64, %struct.sv*, i64, i64, i8, i8, i8, i8, i64, %stru +ct._sublex_info, %struct.sv*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i64, + i16, i8, i8, %struct.hv*, %struct._PerlIO**, %struct.av*, [5 x %unio +n.YYSTYPE], [5 x i64], i64, %struct.cop*, [256 x i8], i8, i8 } %union.YYSTYPE = type { i64 } %struct.yy_stack_frame = type { %union.YYSTYPE, i16, i64, %struct.cv* +} %struct._sublex_info = type { i8, i16, %struct.op*, i8*, i8* } %struct._PerlIO = type opaque %struct.ptr_tbl = type { %struct.ptr_tbl_ent**, i64, i64, %struct.ptr_ +tbl_arena*, %struct.ptr_tbl_ent*, %struct.ptr_tbl_ent* } %struct.ptr_tbl_ent = type { %struct.ptr_tbl_ent*, i8*, i8* } %struct.ptr_tbl_arena = type opaque %struct.REENTR = type { i32 } %struct.PerlIO_list_s = type opaque %struct.perl_debug_pad = type { [3 x %struct.sv] } define i64 @HEAP_MAKE_TAG_FLAGS(i64 %TagBase, i64 %Tag) nounwind uwtab +le readnone { %1 = shl i64 %Tag, 18 %2 = add i64 %1, %TagBase ret i64 %2 } define i32 @Perl_runops_standard(%struct.interpreter* %my_perl) nounwi +nd uwtable { %1 = getelementptr inbounds %struct.interpreter* %my_perl, i64 0, i3 +2 1 %2 = load %struct.op** %1, align 8, !tbaa !3 br label %3 ; <label>:3 ; preds = %3, %0 %op.0 = phi %struct.op* [ %2, %0 ], [ %6, %3 ] %4 = getelementptr inbounds %struct.op* %op.0, i64 0, i32 2 %5 = load %struct.op* (%struct.interpreter*)** %4, align 8, !tbaa !3 %6 = tail call %struct.op* %5(%struct.interpreter* %my_perl) nounwin +d store %struct.op* %6, %struct.op** %1, align 8, !tbaa !3 %7 = icmp eq %struct.op* %6, null br i1 %7, label %8, label %3 ; <label>:8 ; preds = %3 %9 = getelementptr inbounds %struct.interpreter* %my_perl, i64 0, i3 +2 78 store i8 0, i8* %9, align 1, !tbaa !0 ret i32 0 } !0 = metadata !{metadata !"omnipotent char", metadata !1} !1 = metadata !{metadata !"Simple C/C++ TBAA", null} !2 = metadata !{metadata !"long", metadata !0} !3 = metadata !{metadata !"any pointer", metadata !0} !4 = metadata !{metadata !"long long", metadata !0}

        Take a close look at the data definitions for Interpreter, SV_any, HEK, COP etc. Isn't that the most concise (and thorough) description of the entire perl internals you've ever seen?

        Doesn't it nake you wonder (just a little), what could it do with all that information?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        [http://thebottomline.cpaaustra

        Oh. And finally, I asked it to convert the optimised form and output it as C++. The interesting bits are at the top and the very bottom; there is a load of Windows API stuff that I can't be bothered to delete: It seems at 3000+ lines, it is too big to post here, but man is it ever interesting.

        Not that I'd want to maintain it in that form; but can you imagine converting all the Perl sources into C++, loading it up into a new respository and annotating it and then using that to construct a properly object-oriented Perl5 source tree?

        Proper inheritance of the SVt_types; a full cleanup of all the cruft that has built up over the years; starting anew -- from a clean, compilable and working, if weirdly structured and named -- codebase and going forward from there?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        .
Re^2: Perl 5 Optimizing Compiler, Part 5: A Vague Outline Emerges
by moritz (Cardinal) on Aug 31, 2012 at 11:16 UTC
    Secondly, any impedance mismatch between perl and javascript is going to give you agonisingly slow performance. For example, if the full semantics or perl hashes can be provided by javascript dictionaries or objects say, then you can directly translate $foo{bar} into foo.bar or whatever. If however, the javascript facilities aren't suitable, then you might end up having to implement a Perl-style hash using javascript code, which is going to be slow. Also what tends to happen in these sorts of conversions is that the early proof-of-concept work (which uses a javascript dict say) works well and is really, really fast. Then you reach the point where you've done 50% of the work and its going really well, Then you get round to implementing the 'each' op, and suddenly realise that it can't be done using a javascript dict. So you switch to the slow route

    This rings so true.

    I've seen the same thing several times in Perl 6 compilers that targeted other high-level languages (for example kp6 comes to mind, which had a Perl 5 backend). It became so slow that hacking on it wasn't fun anymore, so people stopped hacking on it. (I think perlito descended from kp6 though).

    Rakudo had the same problem, parrot's object system didn't fit it. So it had a huge rewrite switching to a custom object system, which only worked because much of it could be written in C. If it had to be done on top of parrot primitives, it could have never worked with decent speed.

    And the problem is, there are so many fundamental operations that have subtle differences between both Perl 5 and Perl 6 and possible target languages like Javascript: routine invocation (think of @_ elements being aliases), method dispatch, hash and array access (think of autovivification, or what hashes return in scalar context in Perl 5). You might be able to take the speed hit from adapting one of them the right semantics, but all of them put together will kill you.

    I am now convinced that targeting a high-level language like javascript isn't going to result in a speedup. I hope somebody proves me wrong eventually.

      I am now convinced that targeting a high-level language like javascript isn't going to result in a speedup.

      FWIW: I reached a similar conclusion.

      Which is why I think the only game in town is (a) Low Level Virtual Machine.

      I'm not (yet) convinced that it can work its magic on Perl5; but I think it would be (have been?) the best possible underpinning for Perl6.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

      p align=right
      moritz: flavio glock's perlito already was able to parse and compile a fast subset of perl and execute it on nodejs (javascript) 3x faster than perl itself, and also with sbcl (a fast common lisp).
        ... a fast subset of perl...

        The same thing came up in Perl 6 development more times than I could remember. Someone would show up and say "I have a very basic Perl 6 implementation on $virtual_machine that's n times faster than $other_implementation!" and it turns out they had implemented a calculator which did integer math.

        No disrespect to Flavio or Perlito, but "fast subset" only starts to get interesting when that "subset" is more set than sub.

Re^2: Perl 5 Optimizing Compiler, Part 5: A Vague Outline Emerges
by bulk88 (Priest) on Aug 30, 2012 at 20:26 UTC
    Looking at the purely loop unrolling / stack side of things, I did a rough calculation to see what improvement could be expected. I ran the following perl code: $z = $x + $y * 3; ($x and $y contain ints) in a loop under cachegrind, and used the results to estimate approximately what proportion of the execution is spent on loop/stack overhead. The answer was approx 20%. Note that the example above was chosen specifically to use simple perl ops where the overhead is most likely to dominate. Perl has a lot of heavyweight ops like pp_match(), where the relative overhead is going to be much smaller. Also, the LLVM verison will have an overhead of its own for setting up args and the current op, just hopefully less than the current overhead. So that means that overall, the loop/stack stuff is going to give us something less than 20% speedup (and probably a lot less in practice, unless all your perl code does is lots of lightweight stuff like additions). Set against this there might be improvements from LLVM being able to compile the resulting code better, but my speculation is that it won't be significant.
    You are comparing the legacy Perl 5's performance with legacy Perl 5. I dont think Will wants to just recompile Perl 5 the C program with Clang or write yet another B::C.

    The third thing to do with LLVM, (which is what I think BrowserUK is advocating, but I may be wrong), is the wholesale replacement of perl's current runtime, making perl's parser convert the op tree to LLVM IR "bytecode", and thus making a perl a first-class Perl-to-LLVM compiler, in the same way that clang is a first-class C-to-LLVM compiler. This would be a massive undertaking, involving understanding all the subtleties and intricacies of 25,000 lines of pp_* functions, and writing code that will emit equivalent IR - even assuming that you kept the rest of perl the same (i.e. still used SVs, HVs, the context stack, pads etc).

    Why keep the pp_* and SV/contexts/pads if they aren't needed? Some CVs can be determined to be optimizable by SCA in "LLVM Perl". If a CV can't be optimized (eval string exists), the CV stays as slow "legacy" CVs. Easy things get faster, hard things stay slow. The point is to optimize "typical" Perl code, not optimize the performance of JAPHs.
      You are comparing the legacy Perl 5's performance with legacy Perl 5. I dont think Will wants to just recompile Perl 5 the C program with Clang or write yet another B::C.
      No, I am specifically trying to evaluate the effect of the "basic" Yuval approach of JIT converting a CV's ops tree into a list of calls to modified pp_* functions, that then get compiled to IR. That approach seemed to be what Will was advocating. Certainly it was the starting point for the LLVM discussion.

      Dave

Re^2: Perl 5 Optimizing Compiler, Part 5: A Vague Outline Emerges
by Anonymous Monk on Aug 30, 2012 at 13:58 UTC

    Ok, so I'm going to do the "pouring cold water on the project" thing. Sorry.

    Please put on pants!