Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Clunky parsing problem, looking for simple solution

by clintp (Curate)
on Jun 16, 2002 at 01:42 UTC ( #174889=perlquestion: print w/ replies, xml ) Need Help??
clintp has asked for the wisdom of the Perl Monks concerning the following question:

Well, I've been beating my head into the ground for most of the afternoon trying to solve this particular problem. I've got Parrot BASIC nearly compatable with Microsoft's GW-BASIC. Some things are missing of course (SOUND, POKE, COLOR, etc..) but for the most part it works.

Simple scripts it runs fine. However native Parrot BASIC lacks two things that GW BASIC has: mutiple statements on a line and "ELSE". So the larger scripts I run through a Perl filter that "ports" them to Parrot BASIC clean syntax. It's only about 30 lines of Perl now (most of the work is pouring through thousands of lines of BASIC to make sure I got it right).

The last hurdle is IF-THEN-ELSE. What I need to do is take those statements and convert them to Parrot BASIC's simpler syntax. I can do this by hand just fine. Observe:

If you have an "ELSE" statement, things of the form:
Can generally be re-written as:


Where A, B, C are fractional line numbers (assume they're allowed for a moment). Of course there's other ways to re-write that.

What I'm looking for is a bit of help. I've burned out my brain and can't quite seem to get the code correct. You see, the problem isn't just IF..THEN..ELSE, look at the following examples:

7130 IF B3<>0 THEN PRINT "FROM E TO S":W1=B4:X=B5:GOTO 6920 9810 IF K$="Y" AND RND(1) <.5 THEN GOTO 9820 ELSE GOTO 9770 8020 IF SO=0 THEN SO=1 ELSE SO=0 4835 IF V$="K" THEN A$="+K+" ELSE IF V$="M" THEN A$="!M!" ELSE IF V$=" +R" THEN A$="?R?" 1850 IF K3=0 AND EX(Q1,Q2)=0 THEN GOTO 8500 ELSE GOSUB 6000 1800 IF V$="K" THEN A$="+K+" ELSE IF V$="R" THEN A$="?R?" ELSE IF V$=" +M" THEN A$="!M!":Z1=R1:Z2=R2
My favorite is the last one of course. Note that the ELSE is an ELSE IF, there's two of them, and the final statement is actually a compound statement. WHEE. (And people call Perl write-only!).

If you've got some spare cycles, and want to help a fellow Perl Hacker whose lost his mind -- please do. :) Ultimately, a good test is if you can get the last line (1800) to look something like:

1800 IF V$="K" THEN GOTO 1800.2 1800.1 GOTO 1800.4 1800.2 A$="+K+" 1800.3 GOTO 1800.d 1800.4 IF V$="R" THEN GOTO 1800.6 1800.5 GOTO 1800.8 1800.6 A$="?R?" 1800.7 GOTO 1800.d 1800.8 IF V$="M" THEN GOTO 1800.a 1800.9 GOTO 1800.d 1800.a A$="!M!" 1800.b Z1=R1 1800.c Z2=R2 1800.d REM
Without chucking your sanity. Of course the other lines listed above should do something similar. Don't worry about the structure of *my* program: can you take a text line as shown above, and break it down properly? You don't have to do the compound-statement part (A:B:C). That's handled elsewhere in the code (but don't add more please!).

The only requirement for a solution is that I need to distribute this with a package that I can't alter significantly. So if you need a module, Reality says I'm gonna have to cut-and-paste the relevant bits or the module back into your snippet.

Replies are listed 'Best First'.
Re: Clunky parsing problem, looking for simple solution
by kvale (Monsignor) on Jun 16, 2002 at 03:26 UTC
    I don't have a snippet to drop into your compiler, but the general way handle this recursive complexity is with grammar. The following grammar (couresty of the Dragon book) describes the structure of your conditionals, and associates each ELSE with the closest unmatched THEN, eliminating potential ambiguity:
    <stmt> := <matched-stmt> | <unmatched-stmt> <matched-stmt> := "IF" <expr> "THEN" <matched-stmt> "ELSE" <matched-s +tmt> | <other-stmt> <unmatched-stmt> := "IF" <expr> "THEN" <stmt> | "IF" <expr> "THEN" <matched-stmt> "ELSE" <unmatch +ed-stmt>
    other-stmt is any kind of statement other than a conditional.

    Given this grammar, one may create a recursive descent parser that implements this grammar: each nonterminal becomes a subroutine, with stmt at the top of the tree (untested code):
    sub stmt { my $buf = shift; if (matched_stmt($buf)) { return 1; } elsif (unmatched_stmt($buf)) { return 1; } return 0; } # etc.
    A pity that we don't have Perl6 yet, this grammar could be dropped in straightaway :)

Re: Clunky parsing problem, looking for simple solution
by dws (Chancellor) on Jun 16, 2002 at 06:38 UTC
    Interesting problem. A simple regexp-based solution that doesn't take lexemes into account will fail, at least on pathological cases like 1800 IF V$="THEN FOO ELSE IF BAR" THEN $K="FOO"
    I don't have the brain cells left tonight to work through a complete solution, though I have a dim recollection of having done something like this many years back using "fixups". 1800 IF V$="K" THEN A$="+K+" ELSE IF V$="R" THEN A$="?R?" ELSE IF V$="M" THEN A$="!M!":Z1=R1:Z2=R2
    would translate first into
    1800 IF V$="K" THEN GOTO {fixup:skip} 1800.1 GOTO {fixup:after-next-goto} 1800.2 A$="+K+" 1800.3 GOTO {fixup:end} 1800.4 IF V$="R" THEN GOTO {fixup:skip} 1800.5 GOTO {fixup:after-next-goto} 1800.6 A$="?R?" 1800.7 GOTO {fixup:end} 1800.8 IF V$="M" THEN GOTO {fixup:skip} 1800.9 GOTO {fixup:end} 1800.a A$="!M!" 1800.b Z1=R1 1800.c Z2=R2 1800.d REM
    The second pass would peform the fixups.

    • {fixup:skip} becomes the number of the 2nd following line in the sequence
    • {fixup:end} becomes the final line number of the sequence
    • {fixup:after-next-goto} becomes the line number after the next goto in the sequence. (This works because conditionals cannot be nested within a line.)

    By using a {fixup:skip} fixup (rather than calculating the target line number as you're generating the sequence, you allow for the possibility of doing peephole optimizations. In the sequence above,

    1800.8 IF V$="M" THEN GOTO {fixup:skip} 1800.9 GOTO {fixup:end} 1800.a
    could be optimized to
    1800.8 IF V$<>"M" THEN GOTO {fixup:end} 1800.9
    This "renumbers" lines within the sequence, but since target line number calculation/assignment has been deferred, not GOTOs are broken.

      If it were a simple problem, I don't think I'd have posted it here. :)

      For now I suppose you can ignore the truly pathological case where there's a IF-like statement within double-quotes. I'm looking for a 90% solution and if some hand-porting is necessary then so be it.

      Note: I found just such a statement in my test suite. It was in a tutorial on GW-BASIC written in GW-BASIC.

        I don't think an IF-statement within double quotes should be looked at at all by your parser. It's just a string, after all. It's not a statement.

        Lur: "But if this cape shrinks, consider your species extinct!"

Re: Clunky parsing problem, looking for simple solution
by jepri (Parson) on Jun 17, 2002 at 05:38 UTC
    I have to agree with our AnoniMonk here, although a little more politely. It really depends how you are parsing it. Since this would indeed be a non-problem in parse-recdescent, I can assume you are not using it.

    Your problem may be that you are not following the informal rules closely enough. You need some kind of 'scope', so that when you hit a THEN or an ELSE statement, your program goes 'new scope, start evaluating expressions again'. That way your program is compartmentalised and the routines don't have to know if they are in a nested structure or not.

    e.g. if you hit an ELSE IF, your program should think to itself "got and ELSE, mark a label here, start evaluating normally, here's an if, call the if routine'. There's no need for it to know that it is inside an if routine already.

    If you stick with the rules then issues like this become trivial (just remember to ascend from your recursive routines when you get to the end of the line).

    And like our AnoniMonk says, Parse::RecDescent is fantastic for this kind of problem, but I find it makes register allocation much harder than it should be (in fact it beat me totally). With IMCC it should all be easy, though.

    I didn't believe in evil until I dated it.

      (background aside for other's listening, AM's deserve no replies)

      First let me clarify a few things. Firstly, within the BASIC interpreter itself, there's no formal "grammar" at all. Lines are dissected to find the statement being executed (the first token following the line number) and each statement's execution block has hard-wired into it a series of possible templates for that statement.

      As one rude monk asked, "The art of language design hasn't been around for 25+ years without having come up with solutions to this kind of thing (...) did you do any research into language design before embarking on this project?" As a matter of fact I did. 25 years ago when I first saw this language, and shortly thereafer obtained assembler source code for implementations of the language, I learned how it was done. When I thought to recreate it for the Parrot CPU this is how it happened. The total number of assembly instructions is about 2k (for interactive mode, runtime, parser, everything).

      When I get around to porting a pig like QuickBasic, I'll think about using a more open design with a formal grammar.

      So discounting comments directed towards the interpreter itself, I'm left with the GW-BASIC to Parrot BASIC Perl script. This is (so far, and can remain) a simple filter. I don't want to teach it BASIC. The only statement's that's "special" is IF..THEN..ELSE. If the parser gets confused, it gets confused. This is what GW-BASIC was like.

      (specifically to jepri)

      Looking at talexb's solution and your description, I think I'm going to write a small state machine to process tokens on an IF..THEN line and arrange them accordingly. kvale's notes triggered a few light-bulbs as well. I'll post something shortly...

        I think the academic computer world now calls small state machines Discrete Finite Automations or something. I read about how to do it in a lecture series I pulled of the net. Available upon request, or google for it. The upshot was that it was an interesting way of approaching the problem. To me it sounds like you are flogging yourself unnecessarily, but each to his own :)

        I didn't believe in evil until I dated it.

Re: Clunky parsing problem, looking for simple solution
by talexb (Canon) on Jun 17, 2002 at 03:56 UTC
    This is a little late, but I had so much fun writing it I thought I'd post it anyway.
    #!/usr/bin/perl -w # Something to decompose BASIC If-Then-Else statements into simpler # statements, expanding IF .. THEN .. ELSE and IF .. THEN statements # recursively as necessary, and also expanding sub-statments separate +d by # ':' characters. # # In response to node 174889 on PerlMonks. use strict; my @Block = (); # Block to hold output statments in reverse or +der. my $Depth = 0; # Variable to indicate out depth of recursion. # Read DATA in from in-line data section, parse line and output resul +t. while (<DATA>) { print $_ . "expands to:\n"; my ( $LineNumber, $LineSource ) = m/^(\d+) (.+)$/; $Depth = 0; ParseStatement ( $LineSource ); DumpSimpleEquivalent ( $LineNumber ); } # Parse the BASIC statment, recursively if necessary. sub ParseStatement { my ( $ThisLine ) = @_; # This variable will reflect any additional lines added to the bloc +k # of statements. my $StatementCount = 0; # Check for the more difficult case of an IF .. THEN .. ELSE if ( $ThisLine =~ /^\s*IF(.+?)THEN(.+?)ELSE(.+?)$/ ) { my ( @Tokens ) = ( $1, $2, $3 ); # print "3 part statement$: $Tokens[0]-$Tokens[1]-$Tokens[2].\n"; # Only if we're at the first level do we need to output an end bl +ock. if ( $Depth == 0 ) { push ( @Block, "REM End of If Then Else block\n\n" ); } # If there's another level of IF statement within, call self recu +rsively. if ( $3 =~ /IF/ ) { my $LocalCount = 4; $Depth++; $LocalCount += ParseStatement ( $Tokens[2] ); push ( @Block, "ELSE" ); push ( @Block, "GOTO +$LocalCount" ); # To be fixed up later } else # Otherwise, handle normally: Split statement on ':', add then, t +he ELSE # statement and the GOTO to go around this block. { my $LocalCount = 3; my @SubStatements = split ( /:/, $Tokens[2] ); $LocalCount += @SubStatements - 1; push ( @Block, @SubStatements ); push ( @Block, "ELSE" ); push ( @Block, "GOTO +$LocalCount" ); # To be fixed up later } push ( @Block, $Tokens[1] ); push ( @Block, "IF$Tokens[0]THEN" ); } # OK, it's not an IF .. THEN .. ELSE; try just an IF .. THEN. We do +n't need # to worry about recursion. elsif ( $ThisLine =~ /^\s*IF(.+?)THEN(.+?)$/ ) { my ( @Tokens ) = ( $1, $2 ); # print "2 part statement: $Tokens[0]-$Tokens[1].\n"; if ( $Depth == 0 ) { push ( @Block, "REM End of If Then Else block\n\n" ); } my @SubStatements = split ( /:/, $Tokens[1] ); $StatementCount = @SubStatements - 1; push ( @Block, @SubStatements ); push ( @Block, "IF$Tokens[0]THEN" ); } return ( $StatementCount ); } # Routine to dump the expanded (simplified) version of BASIC code. sub DumpSimpleEquivalent { my $LineNumber = shift; my @NewBlock = reverse @Block; my @SubLine = ( (0..9), ('a'..'z') ); my $Index = 0; foreach ( @NewBlock ) { # Update line numbering, delete leading space. $NewBlock[ $Index ] =~ s/\+(\d)/$LineNumber.".".$SubLine[ $Index+$ +1 ]/e; $NewBlock[ $Index ] =~ s/^\s+//; print "$LineNumber.$SubLine[ $Index ] $NewBlock[ $Index++ ]\n"; } @Block = (); } __DATA__ 7130 IF B3<>0 THEN PRINT "FROM E TO S":W1=B4:X=B5:GOTO 6920 9810 IF K$="Y" AND RND(1) <.5 THEN GOTO 9820 ELSE GOTO 9770 8020 IF SO=0 THEN SO=1 ELSE SO=0 4835 IF V$="K" THEN A$="+K+" ELSE IF V$="M" THEN A$="!M!" ELSE IF V$=" +R" THEN A$="?R?" 1850 IF K3=0 AND EX(Q1,Q2)=0 THEN GOTO 8500 ELSE GOSUB 6000 1800 IF V$="K" THEN A$="+K+" ELSE IF V$="R" THEN A$="?R?" ELSE IF V$=" +M" THEN A$="!M!":Z1=R1:Z2=R2

    --t. alex

    "Mud, mud, glorious mud. Nothing quite like it for cooling the blood!" --Michael Flanders and Donald Swann

    Update: After reading the other posts (I couldn't bear to read anything else till I was done my solution) I acknowledge that there are shortcomings in my solution..

    • No optimizations like deleting the second of two GOTO statements
    • A colon inside a string will muck up my statement separation operation

    Update 2: Well, of course it goes without saying, but jepri mentioned that Parse::RecDescent could be used for parsing your BASIC syntax .. but that could be likened to using a transport to carry a single ream of paper.

    I could also have implemented a b-tree structure to store the IF .. THEN .. ELSE statement pieces, but I decided just to write a quick and dirty solution, an array, instead. My solution ain't complete, but it will most likely take care of 90% of the job, and that sounded like what you needed.

Re: Clunky parsing problem, looking for simple solution
by Anonymous Monk on Jun 17, 2002 at 05:14 UTC
    *looks dumbfounded*

    Erm. The art of language design hasn't been around for 25+ years without having come up with solutions to this kind of thing. I don't want to sound negative but did you do any research into language design before embarking on this project?

    I completely agree with the concept of Doing for Oneself before simply grabbing other peoples solutions when doing your own projects, however usually at the point where you start needing to ask for help, it is usually best to go read up on general industry practice first.

    In this case, the very first thing you should have is a grammar based parser. As mentioned above, grammars are extremely effective at solving this particular type of problem. Indeed perl5 has a module called Parse::RecDecent or something which contains some excellent documentation on how to develop a proper grammar using it, and it plugs right into perl beautifully.

    The alternative course of action is, of course, to simply translate the basic language into valid perl and make perl do the hard work. I'm not sure why you chose not to follow this path but I assume you have your reasons.

    Anyway, my recommendation would be to consider re-writing the interpreter using RecDecent. Once the initial learning curve is over you should find it almost trivial to implement the language. I understand that you said "Don't worry about the structure of *my* program" but the essence of a good solution is understanding where the problem lies, and it lies in your program structure :)

    Good luck!

      My apologies, I didn't have time to clean it up, I was in a hurry:

      Missing all of the functions
      Missing some operators
      Missing post-processor that puts the lines back together

      its essentially a recursive decent parser for basic, which does a translation to write out an if-then-else construct as a series of if-then and gotos using labels. It prefixes all the lines of the result with the original line number, with the aim of helping a post-processor then go through and correct the original GOTO lines and labels to point to the correct places.

      it isn't well tested, but it will handle nested if/then/elses correctly. It could be expanded to handle compound statements with one line of code (mind you, in perl, thats not saying much).

      The penalty is that it has to know everything about the language. Its basically half way to a basic compiler for Parrot once you add the functions and ops.

      #!/usr/bin/perl use Data::Dumper; sub generic_node { my ($re, $tree) = @_; $current =~ s/^\s*//; if ($current !~ /$re/i) { return 0; } $current =~ s/$re//i; push @{$tree}, $1; return 1; }; sub generic_node_create { my ($re, $tree, $name) = @_; $current =~ s/^\s*//; if ($current !~ /$re/i) { return 0; } $current =~ s/$re//i; push @{$tree}, [$name, $1]; return 1; }; sub generic_check { my ($re, $tree) = @_; $current =~ s/^\s*//; return $current =~ s/$re//i; }; sub operator { generic_node_create('^([+-=*\/])',shift,"OPERATOR"); }; sub number { generic_node_create('^(\d+)',shift,"NUMBER"); }; sub string { generic_node_create('^\"([^\"]+)\"',shift,"STRING"); }; sub numeric_variable { generic_node_create('^([A-Z]\d)',shift,"NUMERIC_VAR"); }; sub string_variable { generic_node_create('^([A-Z]\$)',shift,"STRING_VAR"); }; sub equals { generic_node_create('^([=])',shift,"EQUALS"); }; sub expression { my $tree = shift; my $t = [EXPRESSION]; string_variable($t) || numeric_variable($t) || string($t) || number($t) || (generic_check('^[\(]',$t) && expression($t) && generic_check( +'^[\)]',$t)) || return 0; if (operator($t)) { expression($t) || return 0; } push @{$tree},$t; return 1; }; sub assignment { my $tree = shift; my $t = [ASSIGNMENT]; numeric_variable($t) || string_variable($t) || return 0; equals($t) || return 0; expression($t) || return 0; push @{$tree},$t; return 1; }; sub comment { if (generic_check('^REM')) { return 1; } }; sub if_condition { my $tree = shift; my $t = [IF_CONDITION]; generic_check('^IF',$t) || return 0; expression($t) || die "Invalid IF statement, expected expressi +on"; generic_check('^THEN',$t) || die "Invalid IF statement, expect +ed THEN"; statement($t) || die "Invalid IF construct, expected statement + after THEN"; if (generic_check('^ELSE',$t)) { statement($t) || die "Invalid IF/ELSE construct, expec +ted statement after ELSE"; } push @{$tree},$t; return 1; }; sub goto_statement { my $tree = shift; my $t = [GOTO]; generic_check('^GOTO',$t) || return 0; number($t) || die 'GOTO must be followed by a number'; push @{$tree},$t; }; sub statement { my $tree = shift; my $t = [STATEMENT]; assignment($t) || comment($t) || if_condition($t) || goto_statement($t) || return 0; push @{$tree},$t; }; # # Pretty-print routines # sub print_expression { my ($subelement) = @_; if ($subelement->[0] eq "EXPRESSION") { if ($subelement->[2][0] ne "OPERATOR") { return "(".print_expression($subelement->[1]). +")"; } else { return "(".print_expression($subelement->[1]). +" ".$subelement->[2][1]." ".prin t_expression($subelement->[3]).")"; } } if ($subelement->[0] eq "NUMBER") { return $subelement->[1]; } if ($subelement->[0] eq "STRING") { return '"'.$subelement->[1].'"'; } if ($subelement->[0] eq "STRING_VAR") { return $subelement->[1]; } if ($subelement->[0] eq "NUMBER_VAR") { return $subelement->[1]; } }; sub print_statement { my ($statement) = @_; my $s = ""; $level++; my $subelement = $statement->[1]; if ($subelement->[0] eq "IF_CONDITION") { add_line("IF ".print_expression($subelement->[1])." TH +EN GOTO THEN_$level"); if ($subelement->[3]) { print_statement($subelement->[3]); } add_line("GOTO END_$level"); add_line("THEN_$level:"); print_statement($subelement->[2]); + add_line("END_$level:"); } if ($subelement->[0] eq "GOTO") { add_line("GOTO ".$subelement->[1][1]); } if ($subelement->[0] eq "ASSIGNMENT") { add_line($subelement->[1][1]." = ".print_expression($s +ubelement->[3])); } if ($subelement->[0] eq "COMMENT") { add_line("REM - comment here"); } $level--; }; sub add_line { push @lines, [$current_line, shift]; }; sub pretty_print { my $tree = shift; $ln = shift(@{$tree}); print_statement($tree->[0]); }; while (<>) { chomp($_); $current = $_; $tree = []; generic_node('^(\d+)', $tree) || die "Line number invalid"; $current_line = $tree->[0]; statement($tree) || die "Invalid statement"; print Dumper($tree); $level = 0; pretty_print($tree); }; for (@lines) { print $_->[0]." ".$_->[1]."\n"; }
        Wow. If you put that much work in, you should make yourself an account so you can feel the goodness of peer appreciation. That and you can give clintp a name to hate, so he doesn't have to keep referring to an anonymous monk :)

        I didn't believe in evil until I dated it.

Re: Clunky parsing problem, looking for simple solution
by gumby (Scribe) on Jun 17, 2002 at 12:06 UTC
    Have you noticed in parrot/languages/scheme or /miniperl the modules; /,, and They all use a basic regexp approach coupled with a tree structure for parsing, and another for the ops.

    Update: I really think it would be a good idea to share some of the details of you're implementation.

      Stay tuned, a detailed description of the guts will be in a public forum near you soon. Of course, this will represent the state-of-the-art in BASIC interpreters c. 1975.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://174889]
Approved by Kanji
Front-paged by ajt
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2016-06-27 11:14 GMT
Find Nodes?
    Voting Booth?
    My preferred method of making French fries (chips) is in a ...

    Results (337 votes). Check out past polls.