http://www.perlmonks.org?node_id=1066499


in reply to sourcefilter with complete parser?

LanX:

I've worked on several translation systems. The first converted programs from Z80 assembler on a Zilog development system to 8086 code, including the operating system calls. Most of that was trivial, as they're very compatible at the source code level. We also had translations for operating system calls, and recognition for various features. We fully automated it so that development/maintenance could continue on the original MC/Z system.

It's tremendously fun to work on a project like that. It's amazing how simple and straightforward the majority of the translation task is. Then as you get into the dusty corners, you find yourself doing much more work to get the next little bit done. In a project like this, I've found that the most common 80% of the syntax takes about 50% of the time to implement. The next 10% takes about 75% of the project time. Then the next 5% takes about 75% more of the project time, and the final 5% takes the remaining 100% of the project time. That's why the second translator, a COBOL to C translation system, wasn't completed. It looked like it was going to take well over the 330% of time alloted to the project. ;^) We did a proof-of-concept system pretty quickly. COBOLs variables were simple enough that variable management wasn't terribly difficult.

So the project started, and we moved along quickly, for a while. But the PERFORM statement (function/subroutine call) in COBOL is an ugly beast. For example, one variation is "PEFORM THROUGH <X>" which executes paragraphs A through X, including all the intermediates, and then return. So we backtracked out the paragraph <==> function, and instead coded things into a large straight-line function. Then, we had to add some funky data structure to the top of the stack telling us what was expected. So at the end of each paragraph, if it was a simple "PERFORM " statement, it would return at the end of when we reached it. If it was a "PERFORM THROUGH <X>" version, it would ignore the paragraph boundaries until it reached the end of paragraph <X>.

Of course, we also had to add more data structures to handle the TIMES, UNTIL, VARYING, etc. clauses.

I think one of the difficulties with that job was due to coming up with a proof-of-concept so quickly. It led to expectations that it would be a much simpler project than it actually was. It also was the project that made me start looking for a simple example of the most difficult construction when approaching a proof-of-concept. When the project was cancelled, we had pretty good coverage over most of the code base. Management asked for a "hands free" translator, and at that time, I didn't realize that we may have been able to renegotiate them into a 99+% translator with some manual intervention required.

The third one, translating one proprietary robot control language to another, was much simpler. The language wasn't general purpose, so it was much less trouble.

In general, it seems that it really is pretty simple to get a good chunk of code handled automatically. If you insist on fully automatic conversion with the entire syntax of the language enabled, though, things can get sticky. One problem is that sometimes to handle the general case of some peculiar bit of syntax may complicate other bits of code. If you can recognize the things that introduce sticky problems, then you could use the simpler code most of the time, and revert to the more complex code only when you detect the situation.

These jobs are high on my list of the most fun. (#1=robotics control, #2=custom industrial automation applications, #3=language translation/compilation, .... #65472=financial applications.)

It was quite a lot of fun, so if you're wanting to chat about some of those sorts of things, PM me and we can chat. Also, if you haven't yet, find a good compiler book and read it. I read the old Dragon book (Aho, Sethi and Ullman), and would recommend it--it was a fun read--though the optimization section was tough to read through, IIRC. There are probably a mess of more contemporary books, but I haven't read them.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: sourcefilter with complete parser?
by LanX (Saint) on Dec 10, 2013 at 23:06 UTC
    Hey Roboticus,

    Talking about translation

    Thanks for your reply and I perfectly know what you are talking about. Little details can produce an unsupportable overhead.

    I spend much time meditating over JS and there are indeed many problems not easily solved.

    BUT I learned not only a lot about JS but also about Perl.

    Saying so I have my personal toy project patching B::Deparse to generate JS and eLisp from a limited Perl dialect. Avoiding the mental overhead to always switch between different keywords and syntax already pays off for me. (like sub {} <->  function () {...} for lambdas or my <-> var for lexicals.).

    This is far away from a full automated translation of the whole language but good enough for me, since I am aware of the different scoping rules of lexicals and I can check against possible conflicts. =)

    But the intention of my question was different

    When digging into Ruby I'm always surprised to be surrounded by Perl idioms hiding behind a pretty syntax, to an extent that Ruby feels like a Perl-dialect + a prebuild object system.

    (It's somehow fascinating and disturbing hearing people praising Ruby-idioms taken from Perl and simultaneously bashing the source of that features.)

    This has lead me to the question if anyone ever tried to use sourcefilters to completely tokenize and translate a different language ( not necessarily more than 90% compatible to an existing one) and then to evaluate the generated Perl-code.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re^2: sourcefilter with complete parser?
by Laurent_R (Canon) on Dec 11, 2013 at 23:13 UTC

    Hi,

    Rolf, to start with, I should say that I have never done anything with source filters and that I barely know what it is all about (having read just a couple of articles on the subject). So my answer might be off-topic, sorry for that if it is.

    I just want to say that I have had an experience similar to Roboticus, although certainly much less extensive. We have a major application (35 million customers, many thousands of programs) running under VMS. We were studying the possibility of migrating it to Unix/Oracle (because support for VMS is likely to end within a few years, perhaps before 2020). The main language used for this application (especially for all the functional parts and the database access part) runs under both VMS and Unix, so that most of the work would be recompilation of the sources and any adaptations needed, large project, but it looks feasible. But probably a quarter to a third of the programs are scripts are written in DCL, which is a VMS scripting language, more or less the equivalent of shell script under Unix. These are used to launch multiple processes in parallel, to synchronize processes, transfer, copy or sort files, etc. The immense majority of these DCL scripts would have to be translated into shell script (or even possibly Perl script in some cases).

    I participated to a "phase-0" pilot proof-of-concept automatic DCL2shell translating effort, and we were able to produce automatically shell equivalent of our DCL programs within a few weeks. But we knew that we had selected relatively easy cases. Therefore, a second phase (still proof-of-concept) was launched to get into the more complicated things. I was not directly involved in this second phase, so that I can only say what was reported to me: once you get into the more gory details, it gets really very complicated. There are a number of things that just can't be processed automatically and need complete refactoring. And that was only phase 2 of proof of concept study.

    The cost of the project, if it were to be launched, was estimated to be in the order of 15 to 20 million euros. A very big amount, indeed, but probably much less than migrating to a completely different system (that would probably cost 3 to 5 times as much). We have some extra time before deciding to go for it or not, but at least we have an idea on how difficult and costly it would be.

    My point was just to broadly confirm the general idea of Roboticus's post: translating 80% of the code is relatively easy, the next 15% are getting really hairy, and the last 5% might take more time than all the rest together.

    OK, I am not talking about a simple program, but about a very complex application with thousands of programs. What you are trying to do might be simpler (hopefully it is), but it is definitely not a simple task.

      Thanks, even if you replied to the wrong person. =)

      Please let me point out that 100% translation is always possible if you don't care about performance.

      The most brutal force (if this word exists) is done by emulating the CPU of a processor supporting the language.

      I'm not interested to emulate or translate more than 80% of a language even if it might be quite easy with some LISP dialects were most of the complicated constructs are just implemented in some core stuff.

      Let me give you an example: there is a very subtle difference how JS and Perl numify strings.

      Frankly I don't care to support software which relies on such differences. It even throws warnings.

      DB<116> use warnings;0+"2" => 2 DB<117> 0+"ss2" => 0 DB<118> 0+"3ss2" => 3 DB<119> use warnings;0+"3ss2" Argument "3ss2" isn't numeric in addition

      >>> a="ss2"; 0+parseInt(a) NaN >>> a="3ss2"; 0+parseInt(a) 3

      Of course it's possible to wrap every variable in numeric context with a function which numifies the Perl way.

      Such a code would be incredibly slow.

      But only supporting the "normal" scalar case were strings are to be treated like numbers is far faster.

      Just substract zero:

      >>> a="2"; 0+a "02" >>> a="2"; 0+a-0 2

      My theory is that it's pretty feasible to create a new language which is a subset of Perl5 or Perl6 and creates acceptable JS.

      "Acceptable" doesn't mean 100% compatible. Neither different versions of Perl nor JS are ever 100% compatible.

      perlito is already quite good at this.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        My theory is that it's pretty feasible to create a new language which is a subset of Perl5 or Perl6 and creates acceptable JS.

        Have you considered helping Pawel Murias, Jimmy Zhuo, Reini Urban, Zaki Mughal, Tokuhiro Matsuno et al progress the JS backend for NQP? Note that while that project's main goal is to compile Rakudo Perl 6 to JS (which is why the repo is called rakudo-js), the interim goal is to compile the much simpler NQP to JS and (based on comments by pmurias) the project is quite close to achieving this lesser goal.

        Achieving this lesser goal would mean a few things potentially relevant to your quest:

        • There'd be a Perlish lang (NQP -- not quite a small subset of P6) that targets JS (in addition to PIR/Parrot, Java/JVM, and MoarVM).
        • There'd be a Perlish toolkit (also called NQP) for easily creating langs, dialects, and compilers that automatically target JS (and NQP's other backends).
        • Anything you did to improve JS codegen would be leveraged by all langs/compilers in the NQP ecosystem (NQP itself, P6, phpish, rubyish, yourlang, et al).