http://www.perlmonks.org?node_id=384588

castaway has asked for the wisdom of the Perl Monks concerning the following question:

I'm not sure if anyone has attempted this, or if we're just crazy, but.. We've been attempting to parse ASCII-art diagrams whichare used to define BNF. Why? Well DB2 doesn't see to fit to actually provide some sort of useful BNF for its SQL syntax, so we're trying to cheat. Without further ado..

Diagram examples:

>>-ACTIVATE--+-DATABASE-+--database-alias-----------------------> '-DB-------' >--+-------------------------------------+--------------------->< '-USER--username--+-----------------+-' '-USING--password-'
>>-select-clause--from-clause--+--------------+-----------------> '-where-clause-' >--+-----------------+--+---------------+---------------------->< '-group-by-clause-' '-having-clause-'
.-ALL------. >>-SELECT--+----------+-----------------------------------------> '-DISTINCT-' >--+-*-----------------------------------------------+--------->< | .-,-------------------------------------------. | | V | | '---+-expression--+-------------------------+-+-+-' | | .-AS-. | | | '-+----+--new-column-name-' | '-exposed-name.*--------------------------'

Mini-explanation of syntax/symbols:
1) Lower case strings containing dashes and colons - these are variables or sub rules.
2) Upper case strings, asterisks - Literal text.
3) '>>' - the start of a rule definition.
4) '->' - definition continues on next line.
5) '>-' - continuation of definition.
6) '><' - end of definition.

The main line of the definition runs along the line indicated by the '>>' symbol, anything on that line is compulsory. Anything above/below the line indicates an option that can appear at that point. The construct that looks like an arrow indicates repeated sections. A more comprehensive explanation can be found in the db2 docs.

The aim of the exercise is to produce a BNF grammar, eg:

rule1: 'ACTIVATE' ('DATABASE' | 'DB') database-alias ('' | 'USER' user +name ( '' | 'USING' password ) ) rule2: select-clause from-clause ('' | where-clause) ('' | group-by-cl +ause) ('' | having-clause) rule3: 'SELECT' ('ALL' | '' | 'DISTINCT') ( '*' | ( expression ( '' | +( ('' | 'AS' ) new-column-name ) ) | exposed-name '.*' )(/,/ *) )
or even:
rule2: select-clause from-clause where-clause(?) group-by-clause(?) ha +ving-clause(?) ..
.. and so on and so forth..

Got any ideas yet? We've taken several shots at it.. (that is, theorbtwo has, I've mostly boggled..) This works in some variations, just not in all..

#!/usr/bin/perl use warnings; use strict; use Data::Dump::Streamer 'Dumper', 'Dump'; use List::Util 'max'; $|++; my $DEBUG = 1; # $x -- virtual line number. # $y -- position along the virtual line of the leftmost point in the p +hysical line. my ($x, $y)=(0, 0); # The position along the virtual line of the last EOL-continuation. my $y_saved; # The line that has the start/end markers on it. my $mainline; # NOTE: (y, x) order, because it makes it easier to iterate over. my @info; while (<>) { chomp; s/\cM//; # Deal with DOSish newlines. print "Init : $_\n"; next if /^#/; if ($_ =~ /^\s*$/) { print "Empty line\n"; $x=0; if (not defined $y_saved) { # warn "Blank line with no continuation"; } $y=$y_saved; $y_saved=undef; next; } if (not defined $y) { warn "Nonempty line at undefined location?"; warn "$_\n"; last; } s(^>>){ my $y=$y+pos($_); $mainline=$x; $info[$x][$y] = ['START', $x, $y]; '--'; }gex; s(^>){ if ($x != $mainline) { print "???: $_\n" if $DEBUG; die "Line mismatch at >--. x=$x, mainline=$mainline"; } # if (not defined $y_saved) { # die ">- with no matching ->"; # } # $y=$y_saved; '-'; }gex; print "($x, $y): $_\n" if $DEBUG; s(>$){ $y_saved=$y+pos($_); '-'; }gex; s(><$){ my $y=$y+pos($_); $info[$x][$y] = ['END', $x, $y]; '--'; }gex; # Quoted literals. Parse these early, but not before we've # determined our position. s( \xA0(.*?)\xA0 ){ my $y=$y+pos($_); # print "Normal literal: $1\n"; $info[$x][$y] = ['LITERAL', $x, $y, $1]; '-' x (length($1)+2); }gex; # s(V([ ]*)\|){ # my $y=$y+pos($_); # $info[$x][$y] = ['REPEAT', $x, $y, length($1)]; # }gex; # # Ugly as fuck, but neccessary -- check if there's a repeat on the # # previous line at this position, and take care of the + at it's # # right-hand-side, to avoid the choice processor becomming confused +. # if ($info[$x-1][$y][0] eq 'REPEAT') { # substr($_, # } # 1 while because it's vital that we get all of these before # continuing, and they can nest. # Collect all possible combinations (eg 1st '+' to 2nd '+' AND 1st ' ++' # to 3rd '+' AND 2nd '+' to 3rd etc, where contents are not \s # and throw away the ones that dont make sense later? 1 while (s( ([+\'.]) ([^\s]*) \1 ) { my $inside = $2; my $y=$y+pos($_); $info[$x][$y] = ['CHOICE', $x, $y, length($inside)]; # print "Choice, inside=$inside\n"; if ($inside =~ /^-+$/) { # print "Emptystring choice\n"; $info[$x][$y+1] = ['LITERAL', $x, $y+1, '']; } '-' . $inside . '-'; }gex); # Quoted variables. These are actually "blocks" that are supposed t +o be # defined later in the same file, but we treat them the same. s(\| ([^ |]*) \|){ my $y=$y+pos($_); $info[$x][$y] = ['VARIABLE', $x, $y, $1]; '--'.('-' x length($1)).'--'; }ge; # Footnotes. I'm really not sure how to deal with them, but parse t +hem # here to prevent them showing up later as literals and parse failur +es. # Note that we can't use () delimiters here, or \) is just an end-pa +ren, # not a literal end-paren. s{\((\d+)\)}{ my $y=$y+pos($_); $info[$x][$y] = ['FOOTNOTE', $x, $y, $1]; '-' x (length($1)+2); }gex; # Thanks, [tye], for the regex. # We include . and * as valid chars for the inside of variables, bec +ause of # "exposed-name.*", in select-clause. Uck. # : is used as part of a variable in ch2host, but only at the beginn +ing, so # religate it to that position until we see more examples. # More examples: psmbod s( ([a-z:0-9\[\]]+(?:-[:a-z.*0-9\[\]]+)*) ){ my $y=$y+pos($_); $info[$x][$y] = ['VARIABLE', $x, $y, $1]; '-' x length ($1) }gex; # Variable then literal, since both can include *s. # = -- psmasn s( ([A-Z*,()_=]+) ){ my $y=$y+pos($_); # print "Normal literal: $1\n"; $info[$x][$y] = ['LITERAL', $x, $y, $1]; '-' x length($1); }gex; die "Unparsed bits? Final: $_" if (m/[^- |]/); # print "Final: $_\n"; $x++; } exit if !@info; Dump \@info if $DEBUG>=2; sub item2atom { my $item = shift; my $linenumber = shift; my $world = shift; return '' if !$item; my $type = $item->[0]; die "Line number passed wrong $linenumber / $item->[1]" if $linenumber != $item->[1]; # print "item2atom(".Dumper($item).")"; if ($type eq 'START' or $type eq 'END') { return ''; } elsif ($type eq 'LITERAL') { return '"' . quotemeta($item->[3]) . '"'; } elsif ($type eq 'CHOICE') { my @subrules; # Fixme: Should look up and down from the source line, until it fi +nds a position # which is not part of the choice. # But watch out for: # ---+--A--+--- # | | # '--B--' my $choicesfound=0; foreach my $linenumber (0..$#{$world}) { my $line = $world->[$linenumber]; print "Choice, checking $linenumber\n" if $DEBUG>=2; # Keep going unless this line also begins a choice here. next if not defined $line->[$item->[2]]; next unless $line->[$item->[2]][0] eq 'CHOICE'; # Make sure it has the same length. next unless $line->[$item->[2]][3] == $item->[3]; print "$linenumber is choice, following...\n" if $DEBUG>=2; $choicesfound++; push @subrules, line2rule($linenumber, $item->[2]+1, $item->[3]- +1, $world); # Replace bit we just iterated over with an equal number of # undefs, so we don't see it twice. splice(@$line, $item->[2]+1, $item->[3]-1, (undef)x($item->[3]-1 +)); } if ($choicesfound < 2) { warn "Choice with one ($choicesfound) option at ($item->[1], $ +item->[2])"; $world->[$linenumber]->[$item->[2]] = undef; } return "(".join(' | ', @subrules).")"; } elsif ($type eq 'VARIABLE') { return $item->[3]; } else { die "Unknown item type: $type" . Dumper($item); } } sub line2rule { my ($linenumber, $colnumber, $length, $world) = @_; my @result; for my $i ( $colnumber .. $colnumber+$length-1 ) { # print "line2rule: ", Dumper $world->[$linenumber][$i]; my $result = item2atom($world->[$linenumber][$i], $linenumber, $wo +rld); next unless $result; push @result, $result; } return join ' ', @result; } # print "Mainline: $mainline\n"; print line2rule($mainline, 0, 0+@{$info[$mainline]}, \@info), "\n";
The code above has problems with the 'option list' parts, which we are calling choices. In some instances, a choice begins at a + and ends at the next +, and in some instances it begins at the first + and ends at the last one, with sub choices in between. My idea to fix that was to collect all possible combinations of start/end, and throw away any that turn out not to have any actual values attached.. But surely theres a better way.. As for the repeat bits, well, we havent gotten that far yet.

Any help is appreciated.. (Yes, I've attempted to find a DB2 BNF grammar, but without luck, and this has taken a life of its own anyway.. )

C. and theorbtwo

Replies are listed 'Best First'.
Re: Parsing BNF syntax diagrams..
by kvale (Monsignor) on Aug 20, 2004 at 15:22 UTC
    Programs in most computer languages, like Perl, are one-dimensional linear strings. The 'language' above (these used to be called railroad diagrams) is two-dimensional and parsing nonlinear languages is known to be difficult in general.

    Another example of a two-dimensional language is Befunge, a language whose sole purpose is to be hard to parse :-) You might try looking at Befunge parsers to get an idea of how to go about it.

    I have not tried to understand your compiler, but if I was to go about this, I would split each diagram into a two-dimensional array of single characters. Then I would 'walk' the diagram, building up a topological graph. Finally, I would convert that graph/NFA into a grammar.

    But the fastest method might just be to convert the diagrams manually yourself.

    -Mark

      You can find a dotted-decimal representation of the same syntax diagrams (meant for screen readers) in the DB2 for zSeries online documentation which should be much easier to parse. Unfortunately, getting to the dotted-decimal representation isn't all that easy... here's a quick summary.
      1. Open the DB2 for zSeries information center in your Web browser.
      2. Find the statement or command you're interested in.
      3. View the source of the frame to find <img src="c.gif" alt="Read syntax diagram" longdesc="syntax.htm" border="0" />. syntax.htm will be some long HTML filename that contains the dotted-decimal version of the syntax.
      4. View the dotted-decimal syntax by copying the frame URL and changing the filename to point to the dotted-decimal filename. The DB2 for zSeries information center uses frames heavily, so the base URL doesn't change to reflect the content of the page you're looking at.

      Here's an example of two "normal" syntax diagrams and their dotted decimal equivalents (1 (subselect) and 2 (select-clause)).

      These should be much easier to parse if you can get LWP to grab the dotted-decimal syntax files.

      Update: Duh, here's a page that describes the dotted-decimal syntax format.

      Update 2: If you need the DB2 for Linux, UNIX, Windows version of the syntax, you can install the DB2 8.2 Information Center locally, unzip the doc.zip files in eclipse/plugins/com.ibm.db2.udb.doc/doc.zip, and work with the HTML files directly rather than going through LWP and the Web. Nice way of avoiding the relative URI problems with the framesets, too. Unfortunately, DB2 8.2 for Linux, UNIX, and Windows won't be released for a little while -- in the meantime, if you can still get your hands on the DB2 Stinger beta information center that will also do the job. Versions of DB2 prior to 8.2 don't include the dotted-decimal syntax diagrams.

        TIMTOWTSAC++ (There's more than one way to skin a cat.)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Parsing BNF syntax diagrams..
by Solo (Deacon) on Aug 20, 2004 at 15:23 UTC
    DB2 claims to be SQL-92 compliant. There are several ANSI SQL-92 BNF definitions out there. It might be reasonable to start with one of those and add the additional features of the DB2 implementation as you bump into them.

    --Solo

    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
      Of the top 100 RDBMS's by usage, I have heard of a single one that is truly SQL-92-compliant (or SQL-99-compliant, for that matter). Some are more compliant than others, but that's almost a matter of taste vs. compliance.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        Well, I said 'claims to be compliant' for that reason.

        At any rate, it seems no more difficult to me to adjust the BNF for SQL-92 than to create the BNF from the diagrams IBM provides. In one case you'd need to test thoroughly to find all the DB2 statements not in the standard or redefined for DB2, on the other you'd need to make sure you'd parsed the diagrams into BNF correctly.

        --Solo

        --
        You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
Re: Parsing BNF syntax diagrams..
by dragonchild (Archbishop) on Aug 20, 2004 at 14:17 UTC
    If I remember correctly, BNF is a standard way of defining commands'n'options. So, there should be a BNF parser somewhere which would parse to a standard data structure. You should be able to take the data structure and convert it into something the DB2 would understand.

    Sorry if I'm not more help, but I'm still not sure what you're trying to transform the BNF into ...

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      Sorry if that wasn't clear. I'd like to have a BNF (preferably something Parse::RecDescent likes), out of these silly diagrams.. With that I would then parse actual DB2 SQL.

      C.

        I have a vague recollection that many moons ago the early DB2 docs contained real EBNF diagrams. I would not be at all surprised if somewhere in IBM there is tool (that the author is no doubt very proud of :), that takes EBNF syntax as input and produces these diagrams as output.

        It just might be worth contacting your local friendly IBM support person and asking if there is any chance of tracking the EBNF down.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Parsing BNF syntax diagrams..
by eric256 (Parson) on Aug 21, 2004 at 00:03 UTC

    While I haven't gotten to read the code (or couldn't :) ) but it seems like it would be best to map these two a 2 diminsional array. Move the continuations up and attatch them. Then start at the begining and travel along. Maybe even build an OO representation of choices etc that then condenses down into your final BNF. Looks kinda like fun once you start thinking about it much. :)


    ___________
    Eric Hodges
Re: Parsing BNF syntax diagrams..
by ikegami (Patriarch) on Aug 21, 2004 at 04:13 UTC
    I'm using Parse::RecDescent to parse that "ASCII-art". By "using", I mean "coercing", but I've made some progress. It's gonna be interesting :)

      Whoohoo! I finally have parsers for the above. See...
      387769: This version uses Parse::RecDescent.
      387772: This version doesn't uses Parse::RecDescent.