Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I'm not sure if anyone has attempted this, or if we're just crazy, but.. We've been attempting to parse ASCII-art diagrams whichare used to define BNF. Why? Well DB2 doesn't see to fit to actually provide some sort of useful BNF for its SQL syntax, so we're trying to cheat. Without further ado..

Diagram examples:

>>-ACTIVATE--+-DATABASE-+--database-alias-----------------------> '-DB-------' >--+-------------------------------------+--------------------->< '-USER--username--+-----------------+-' '-USING--password-'
>>-select-clause--from-clause--+--------------+-----------------> '-where-clause-' >--+-----------------+--+---------------+---------------------->< '-group-by-clause-' '-having-clause-'
.-ALL------. >>-SELECT--+----------+-----------------------------------------> '-DISTINCT-' >--+-*-----------------------------------------------+--------->< | .-,-------------------------------------------. | | V | | '---+-expression--+-------------------------+-+-+-' | | .-AS-. | | | '-+----+--new-column-name-' | '-exposed-name.*--------------------------'

Mini-explanation of syntax/symbols:
1) Lower case strings containing dashes and colons - these are variables or sub rules.
2) Upper case strings, asterisks - Literal text.
3) '>>' - the start of a rule definition.
4) '->' - definition continues on next line.
5) '>-' - continuation of definition.
6) '><' - end of definition.

The main line of the definition runs along the line indicated by the '>>' symbol, anything on that line is compulsory. Anything above/below the line indicates an option that can appear at that point. The construct that looks like an arrow indicates repeated sections. A more comprehensive explanation can be found in the db2 docs.

The aim of the exercise is to produce a BNF grammar, eg:

rule1: 'ACTIVATE' ('DATABASE' | 'DB') database-alias ('' | 'USER' user +name ( '' | 'USING' password ) ) rule2: select-clause from-clause ('' | where-clause) ('' | group-by-cl +ause) ('' | having-clause) rule3: 'SELECT' ('ALL' | '' | 'DISTINCT') ( '*' | ( expression ( '' | +( ('' | 'AS' ) new-column-name ) ) | exposed-name '.*' )(/,/ *) )
or even:
rule2: select-clause from-clause where-clause(?) group-by-clause(?) ha +ving-clause(?) ..
.. and so on and so forth..

Got any ideas yet? We've taken several shots at it.. (that is, theorbtwo has, I've mostly boggled..) This works in some variations, just not in all..

#!/usr/bin/perl use warnings; use strict; use Data::Dump::Streamer 'Dumper', 'Dump'; use List::Util 'max'; $|++; my $DEBUG = 1; # $x -- virtual line number. # $y -- position along the virtual line of the leftmost point in the p +hysical line. my ($x, $y)=(0, 0); # The position along the virtual line of the last EOL-continuation. my $y_saved; # The line that has the start/end markers on it. my $mainline; # NOTE: (y, x) order, because it makes it easier to iterate over. my @info; while (<>) { chomp; s/\cM//; # Deal with DOSish newlines. print "Init : $_\n"; next if /^#/; if ($_ =~ /^\s*$/) { print "Empty line\n"; $x=0; if (not defined $y_saved) { # warn "Blank line with no continuation"; } $y=$y_saved; $y_saved=undef; next; } if (not defined $y) { warn "Nonempty line at undefined location?"; warn "$_\n"; last; } s(^>>){ my $y=$y+pos($_); $mainline=$x; $info[$x][$y] = ['START', $x, $y]; '--'; }gex; s(^>){ if ($x != $mainline) { print "???: $_\n" if $DEBUG; die "Line mismatch at >--. x=$x, mainline=$mainline"; } # if (not defined $y_saved) { # die ">- with no matching ->"; # } # $y=$y_saved; '-'; }gex; print "($x, $y): $_\n" if $DEBUG; s(>$){ $y_saved=$y+pos($_); '-'; }gex; s(><$){ my $y=$y+pos($_); $info[$x][$y] = ['END', $x, $y]; '--'; }gex; # Quoted literals. Parse these early, but not before we've # determined our position. s( \xA0(.*?)\xA0 ){ my $y=$y+pos($_); # print "Normal literal: $1\n"; $info[$x][$y] = ['LITERAL', $x, $y, $1]; '-' x (length($1)+2); }gex; # s(V([ ]*)\|){ # my $y=$y+pos($_); # $info[$x][$y] = ['REPEAT', $x, $y, length($1)]; # }gex; # # Ugly as fuck, but neccessary -- check if there's a repeat on the # # previous line at this position, and take care of the + at it's # # right-hand-side, to avoid the choice processor becomming confused +. # if ($info[$x-1][$y][0] eq 'REPEAT') { # substr($_, # } # 1 while because it's vital that we get all of these before # continuing, and they can nest. # Collect all possible combinations (eg 1st '+' to 2nd '+' AND 1st ' ++' # to 3rd '+' AND 2nd '+' to 3rd etc, where contents are not \s # and throw away the ones that dont make sense later? 1 while (s( ([+\'.]) ([^\s]*) \1 ) { my $inside = $2; my $y=$y+pos($_); $info[$x][$y] = ['CHOICE', $x, $y, length($inside)]; # print "Choice, inside=$inside\n"; if ($inside =~ /^-+$/) { # print "Emptystring choice\n"; $info[$x][$y+1] = ['LITERAL', $x, $y+1, '']; } '-' . $inside . '-'; }gex); # Quoted variables. These are actually "blocks" that are supposed t +o be # defined later in the same file, but we treat them the same. s(\| ([^ |]*) \|){ my $y=$y+pos($_); $info[$x][$y] = ['VARIABLE', $x, $y, $1]; '--'.('-' x length($1)).'--'; }ge; # Footnotes. I'm really not sure how to deal with them, but parse t +hem # here to prevent them showing up later as literals and parse failur +es. # Note that we can't use () delimiters here, or \) is just an end-pa +ren, # not a literal end-paren. s{\((\d+)\)}{ my $y=$y+pos($_); $info[$x][$y] = ['FOOTNOTE', $x, $y, $1]; '-' x (length($1)+2); }gex; # Thanks, [tye], for the regex. # We include . and * as valid chars for the inside of variables, bec +ause of # "exposed-name.*", in select-clause. Uck. # : is used as part of a variable in ch2host, but only at the beginn +ing, so # religate it to that position until we see more examples. # More examples: psmbod s( ([a-z:0-9\[\]]+(?:-[:a-z.*0-9\[\]]+)*) ){ my $y=$y+pos($_); $info[$x][$y] = ['VARIABLE', $x, $y, $1]; '-' x length ($1) }gex; # Variable then literal, since both can include *s. # = -- psmasn s( ([A-Z*,()_=]+) ){ my $y=$y+pos($_); # print "Normal literal: $1\n"; $info[$x][$y] = ['LITERAL', $x, $y, $1]; '-' x length($1); }gex; die "Unparsed bits? Final: $_" if (m/[^- |]/); # print "Final: $_\n"; $x++; } exit if !@info; Dump \@info if $DEBUG>=2; sub item2atom { my $item = shift; my $linenumber = shift; my $world = shift; return '' if !$item; my $type = $item->[0]; die "Line number passed wrong $linenumber / $item->[1]" if $linenumber != $item->[1]; # print "item2atom(".Dumper($item).")"; if ($type eq 'START' or $type eq 'END') { return ''; } elsif ($type eq 'LITERAL') { return '"' . quotemeta($item->[3]) . '"'; } elsif ($type eq 'CHOICE') { my @subrules; # Fixme: Should look up and down from the source line, until it fi +nds a position # which is not part of the choice. # But watch out for: # ---+--A--+--- # | | # '--B--' my $choicesfound=0; foreach my $linenumber (0..$#{$world}) { my $line = $world->[$linenumber]; print "Choice, checking $linenumber\n" if $DEBUG>=2; # Keep going unless this line also begins a choice here. next if not defined $line->[$item->[2]]; next unless $line->[$item->[2]][0] eq 'CHOICE'; # Make sure it has the same length. next unless $line->[$item->[2]][3] == $item->[3]; print "$linenumber is choice, following...\n" if $DEBUG>=2; $choicesfound++; push @subrules, line2rule($linenumber, $item->[2]+1, $item->[3]- +1, $world); # Replace bit we just iterated over with an equal number of # undefs, so we don't see it twice. splice(@$line, $item->[2]+1, $item->[3]-1, (undef)x($item->[3]-1 +)); } if ($choicesfound < 2) { warn "Choice with one ($choicesfound) option at ($item->[1], $ +item->[2])"; $world->[$linenumber]->[$item->[2]] = undef; } return "(".join(' | ', @subrules).")"; } elsif ($type eq 'VARIABLE') { return $item->[3]; } else { die "Unknown item type: $type" . Dumper($item); } } sub line2rule { my ($linenumber, $colnumber, $length, $world) = @_; my @result; for my $i ( $colnumber .. $colnumber+$length-1 ) { # print "line2rule: ", Dumper $world->[$linenumber][$i]; my $result = item2atom($world->[$linenumber][$i], $linenumber, $wo +rld); next unless $result; push @result, $result; } return join ' ', @result; } # print "Mainline: $mainline\n"; print line2rule($mainline, 0, 0+@{$info[$mainline]}, \@info), "\n";
The code above has problems with the 'option list' parts, which we are calling choices. In some instances, a choice begins at a + and ends at the next +, and in some instances it begins at the first + and ends at the last one, with sub choices in between. My idea to fix that was to collect all possible combinations of start/end, and throw away any that turn out not to have any actual values attached.. But surely theres a better way.. As for the repeat bits, well, we havent gotten that far yet.

Any help is appreciated.. (Yes, I've attempted to find a DB2 BNF grammar, but without luck, and this has taken a life of its own anyway.. )

C. and theorbtwo

In reply to Parsing BNF syntax diagrams.. by castaway

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (4)
    As of 2019-08-18 19:08 GMT
    Find Nodes?
      Voting Booth?
      If you were the first to set foot on the Moon, what would be your epigram?

      Results (135 votes). Check out past polls.