Parsing EBNF with Perl 5 to eventually parse XML with Perl 6

Updates - several code changes. There are probably more bugs to be fixed... (Geez, I'm fixing bugs in code that can't even be executed? :)

After reading Parsing with Perl 6, I wanted to write Perl 6 regexes too. Just for fun, and to get used to the new powerful regex language.

I once wrote a very simple xml-ish well formedness checking regex, but it uses PCRE with a patch, so it isn't really useful. Let's turn it into something even less useful, I thought. So I translated it to Perl 6:

rule xml :i {
    ^
    \s*
    [
        # Single tags like <foo/>
        \<
        \s*
        <[a-z:]>+
        (?:
            \s*<[a-z:]>*
            \s* = \s*
            (?:' <[^']>* ' | " <[^"]>* ")
        )*
        \s*
        /\s*
        \>
    |
        # Tags in pairs like <foo>content</foo>
        \<
        \s*
        $1 := (<[a-z:]>+)
        [
            \s*<[a-z:]>*
            \s* = \s*
            [ ' <[^']>* ' | " <[^"]>* " ]
        ]*
        \s*
        \>

        [ <[^<>]>* | <xml> ]*

        \< \s* / \s* $1 \s* \>
    ]
    \s*
}
[download]

Now that was child's play. It's only a few bytes different from the piece of code I wrote before.

Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at http://www.w3.org/TR/2000/REC-xml-20001006, and it has grammar in Extended Backus-Naur Form (name ::= definition). I started to translate:

document ::=    prolog element Misc*
Char     ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#
+x10000-#x10FFFF]
S        ::=    (#x20 | #x9 | #xD | #xA)+
etcetera, etcetera
[download]

into:

rule document { <prolog> <element> <Misc>* }
rule Char     { <[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]> }
rule S        { <[\x20\x9\xD\xA]>+ }
rule NameChar { <Letter> | <Digit> | <[._:-]> | <CombiningChar> | <Ext
+ender> }
rule Name     { [ <Letter | <[_:]> ] <NameChar>* }
rule Names    { <Name> [<S> <Name>]* }

rule Nmtoken  { <NameChar>+ }
rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* }
etcetera, etcetera
[download]

Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:

#!/usr/bin/perl -w

use strict;
use XML::TreeBuilder;

my $tree = XML::TreeBuilder->new;

$tree->parse_file('rec');
# http://www.w3.org/TR/2000/REC-xml-20001006.xml

for ($tree->look_down(_tag => 'prod')) {
    print 'rule ';
    print $_->look_down(_tag => 'lhs')->content_list;
    print " {\n    ";

    my $not = 0;
    my $p = 1;

    my @foo = $_->look_down(_tag => 'rhs');
    for (map { $_->content_list } @foo) {
         if (ref) {
             if (my $t = $_->look_down(_tag => 'nt')) {
                  print '<', $t->content_list, '>';
             }

             $not = 0, print '>' if $not and $p == $not;
             next;
         }
         {
             s/\xC2\xA0/ /g; # Non breaking space
             s/\s+/ /g;
             s/^S//g and do {
                  print '<S>';
                  $not = 0, print '>' if $not and $p == $not;
                  next;
             };
             s/^'([^']*)'// || s/^"([^"]*)"// and do {
                  my $m = $1;
                  if ($m =~ /[\@\$%<>:\\.]/) {
                      $m =~ s/\\/\\\\/g;
                      $m =~ s/'/\\'/g;
                      print "<'$m'>";
                  } else {
                      print $m;
                  }
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^\[([^]]*)\]// and do {
                  my $m = $1;
                  $m =~ s/#x/\\x/g;
          my $neg = (($m =~ s/^\^//) ? '-' : '');
                  print "<$neg\[$m]>";
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^#x([0-9A-Fa-f]+)// and do {
                  print "\\x$1";
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^-// and do {
                  $not = $p;
                  print '<!after';
                  redo;
             };
             s/^\(// and do {
                  print '[';
                  ++$p;
                  redo;
             };
             s/^\)// and do {
                  print ']';
                  --$p;
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^[|*+? ]+// and do {
                  (my $x = $&) =~ tr/()/[]/;
                  print $x;
                  redo;
             };
             next if not length;
             print "ERROR: $_\n";
         }
         
    }

    print "\n}\n\n";
}
[download]

Okay, let's see: there's a tree that is walked, printing the new version on the fly. This is done in an ugly way, but I think that's allowed for quick hacks.

And here is what it magically spit out:

rule document {
    <prolog> <element> <Misc>*
}

rule Char {
    \x9 | \xA | \xD | <[\x20-\xD7FF]> | <[\xE000-\xFFFD]> | <[\x10000-
+\x10FFFF]>
}

rule S {
    [\x20 | \x9 | \xD | \xA]+
}

rule NameChar {
    <Letter> | <Digit> | <'.'> | - | _ | <':'> | <CombiningChar> | <Ex
+tender>
}

rule Name {
    [<Letter> | _ | <':'>] [<NameChar>]*
}

rule Names {
    <Name> [<S> <Name>]*
}

rule Nmtoken {
    [<NameChar>]+
}

rule Nmtokens {
    <Nmtoken> [<S> <Nmtoken>]*
}

rule EntityValue {
    " [<-[%&"]> | <PEReference> | <Reference>]* " | ' [<-[%&']> | <PER
+eference> | <Reference>]* '
}

rule AttValue {
    " [<-[<&"]> | <Reference>]* " | ' [<-[<&']> | <Reference>]* '
}

rule SystemLiteral {
    [" <-["]>* "] | [' <-[']>* '] 
}

rule PubidLiteral {
    " <PubidChar>* " | ' [<PubidChar> <!after '>]* '
}

rule PubidChar {
    \x20 | \xD | \xA | <[a-zA-Z0-9]> | <[-'()+,./:=?;!*#@$_%]>
}

rule CharData {
    <-[<&]>* <!after [<-[<&]>* <']]>'> <-[<&]>*]>
}

rule Comment {
    <'<!--'> [[<Char> <!after ->] | [- [<Char> <!after ->]]]* <'-->'>
}

rule PI {
    <'<?'> <PITarget> [<S> [<Char>* <!after [<Char>* <'?>'> <Char>*]>]
+]? <'?>'>
}

rule PITarget {
    <Name> <!after [[X | x] [M | m] [L | l]]>
}

rule CDSect {
    <CDStart> <CData> <CDEnd>
}

rule CDStart {
    <'<![CDATA['>
}

rule CData {
    [<Char>* <!after [<Char>* <']]>'> <Char>*]>] 
}

rule CDEnd {
    <']]>'>
}

rule prolog {
    <XMLDecl>? <Misc>* [<doctypedecl> <Misc>*]?
}

rule XMLDecl {
    <'<?xml'> <VersionInfo> <EncodingDecl>? <SDDecl>? <S>? <'?>'>
}

rule VersionInfo {
    <S> version <Eq> [' <VersionNum> ' | " <VersionNum> "]
}

rule Eq {
    <S>? = <S>?
}

rule VersionNum {
    [<[a-zA-Z0-9_.:]> | -]+
}

rule Misc {
    <Comment> | <PI> | <S>
}

rule doctypedecl {
    <'<!DOCTYPE'> <S> <Name> [<S> <ExternalID>]? <S>? [[ [<markupdecl>
+ | <DeclSep>]* ] <S>?]? <'>'>
}

rule DeclSep {
    <PEReference> | <S>
}

rule markupdecl {
    <elementdecl> | <AttlistDecl> | <EntityDecl> | <NotationDecl> | <P
+I> | <Comment> 
}

rule extSubset {
    <TextDecl>? <extSubsetDecl>
}

rule extSubsetDecl {
    [ <markupdecl> | <conditionalSect> | <DeclSep>]*
}

rule SDDecl {
     <S> standalone <Eq> [[' [yes | no] '] | [" [yes | no] "]] 
}

rule LanguageID {
    <Langcode> [- <Subcode>]*
}

rule Langcode {
    <ISO639Code> | <IanaCode> | <UserCode>
}

rule ISO639Code {
    [<[a-z]> | <[A-Z]>] [<[a-z]> | <[A-Z]>]
}

rule IanaCode {
    [i | I] - [<[a-z]> | <[A-Z]>]+
}

rule UserCode {
    [x | X] - [<[a-z]> | <[A-Z]>]+
}

rule Subcode {
    [<[a-z]> | <[A-Z]>]+
}

rule element {
    <EmptyElemTag>| <STag> <content> <ETag>
}

rule STag {
    <'<'> <Name> [<S> <Attribute>]* <S>? <'>'>
}

rule Attribute {
    <Name> <Eq> <AttValue>
}

rule ETag {
    <'</'> <Name> <S>? <'>'>
}

rule content {
    <CharData>? [[<element> | <Reference> | <CDSect> | <PI> | <Comment
+>] <CharData>?]*
}

rule EmptyElemTag {
    <'<'> <Name> [<S> <Attribute>]* <S>? <'/>'>
}

rule elementdecl {
    <'<!ELEMENT'> <S> <Name> <S> <contentspec> <S>? <'>'>
}

rule contentspec {
    EMPTY | ANY | <Mixed> | <children> 
}

rule children {
    [<choice> | <seq>] [? | * | +]?
}

rule cp {
    [<Name> | <choice> | <seq>] [? | * | +]?
}

rule choice {
    ( <S>? <cp> [ <S>? | <S>? <cp> ]+ <S>? )
}

rule seq {
    ( <S>? <cp> [ <S>? , <S>? <cp> ]* <S>? )
}

rule Mixed {
    ( <S>? #PCDATA [<S>? | <S>? <Name>]* <S>? )* | ( <S>? #PCDATA <S>?
+ ) 
}

rule AttlistDecl {
    <'<!ATTLIST'> <S> <Name> <AttDef>* <S>? <'>'>
}

rule AttDef {
    <S> <Name> <S> <AttType> <S> <DefaultDecl>
}

rule AttType {
    <StringType> | <TokenizedType> | <EnumeratedType> 
}

rule StringType {
    CDATA
}

rule TokenizedType {
    ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS
}

rule EnumeratedType {
    <NotationType> | <Enumeration> 
}

rule NotationType {
    NOTATION <S> ( <S>? <Name> [<S>? | <S>? <Name>]* <S>? ) 
}

rule Enumeration {
    ( <S>? <Nmtoken> [<S>? | <S>? <Nmtoken>]* <S>? )
}

rule DefaultDecl {
    #REQUIRED | #IMPLIED | [[#FIXED <S><AttValue>]
}

rule conditionalSect {
    <includeSect> | <ignoreSect> 
}

rule includeSect {
    <'<!['> <S><extSubsetDecl> <']]>'> 
}

rule ignoreSect {
    <'<!['> <S><ignoreSectContents>* <']]>'>
}

rule ignoreSectContents {
    <Ignore> [<'<!['> <ignoreSectContents> <']]>'> <Ignore>]*
}

rule Ignore {
    <Char>* <!after [<Char>* [<'<!['> | <']]>'>] <Char>*]> 
}

rule CharRef {
    &# <[0-9]>+ ; | &#x <[0-9a-fA-F]>+ ;
}

rule Reference {
    <EntityRef> | <CharRef>
}

rule EntityRef {
    & <Name> ;
}

rule PEReference {
    <'%'> <Name> ;
}

rule EntityDecl {
    <GEDecl> | <PEDecl>
}

rule GEDecl {
    <'<!ENTITY'> <S> <Name> <S> <EntityDef> <S>? <'>'>
}

rule PEDecl {
    <'<!ENTITY'> <S> <'%'> <S> <Name> <S> <PEDef> <S>? <'>'>
}

rule EntityDef {
    <EntityValue> | [<ExternalID> <NDataDecl>?]
}

rule PEDef {
    <EntityValue> | <ExternalID>
}

rule ExternalID {
    SYSTEM <S> <SystemLiteral>| PUBLIC <S> <PubidLiteral> <S> <SystemL
+iteral> 
}

rule NDataDecl {
    <S> NDATA <S> <Name>
}

rule TextDecl {
    <'<?xml'> <VersionInfo>? <EncodingDecl> <S>? <'?>'>
}

rule extParsedEnt {
    <TextDecl>? <content>
}

rule extPE {
    <TextDecl>? <extSubsetDecl>
}

rule EncodingDecl {
    <S> encoding <Eq> [" <EncName> " | ' <EncName> ' ] 
}

rule EncName {
    <[A-Za-z]> [<[A-Za-z0-9._]> | -]*
}

rule NotationDecl {
    <'<!NOTATION'> <S> <Name> <S> [<ExternalID> | <PublicID>] <S>? <'>
+'>
}

rule PublicID {
    PUBLIC <S> <PubidLiteral> 
}

rule Letter {
    <BaseChar> | <Ideographic>
}

rule BaseChar {
    # Large block manually removed.
}

rule Ideographic {
    <[\x4E00-\x9FA5]> | \x3007 | <[\x3021-\x3029]> 
}

rule CombiningChar {
    # Large block manually removed.
}

rule Digit {
    # Large block manually removed.
}

rule Extender {
    # Large block manually removed.
}
[download]

I'm not sure about the <!after> parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.

Oh, it's not finished. It needs grammar XML { ... } wrapped around it. I leave that as an excercise for the reader ;)

I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...

Back to Meditations