<?xml version="1.0" encoding="windows-1252"?>
<node id="179755" title="Parsing EBNF with Perl 5 to eventually parse XML with Perl 6" created="2002-07-05 18:20:53" updated="2005-08-02 11:30:07">
<type id="120">
perlmeditation</type>
<author id="132236">
Juerd</author>
<data>
<field name="doctext">
&lt;p&gt;
&lt;b&gt;Updates&lt;/b&gt; - several code changes. There are probably more bugs to be fixed... (Geez, I'm fixing bugs in code that can't even be executed? :)
&lt;/p&gt;

&lt;p&gt;
After reading [id://179555], I wanted to write Perl 6 regexes too. Just for fun, and to get used to the new powerful regex language.
&lt;/p&gt;
&lt;p&gt;
I once [id://155967|wrote] a very simple xml-ish well formedness checking regex, but it uses PCRE with a patch, so it isn't really useful. Let's turn it into something even less useful, I thought. So I translated it to Perl 6:
&lt;code&gt;
rule xml :i {
    ^
    \s*
    [
        # Single tags like &lt;foo/&gt;
        \&lt;
        \s*
        &lt;[a-z:]&gt;+
        (?:
            \s*&lt;[a-z:]&gt;*
            \s* = \s*
            (?:' &lt;[^']&gt;* ' | " &lt;[^"]&gt;* ")
        )*
        \s*
        /\s*
        \&gt;
    |
        # Tags in pairs like &lt;foo&gt;content&lt;/foo&gt;
        \&lt;
        \s*
        $1 := (&lt;[a-z:]&gt;+)
        [
            \s*&lt;[a-z:]&gt;*
            \s* = \s*
            [ ' &lt;[^']&gt;* ' | " &lt;[^"]&gt;* " ]
        ]*
        \s*
        \&gt;

        [ &lt;[^&lt;&gt;]&gt;* | &lt;xml&gt; ]*

        \&lt; \s* / \s* $1 \s* \&gt;
    ]
    \s*
}
&lt;/code&gt;
Now that was child's play. It's only a few bytes different from the piece of code I wrote before.
&lt;/p&gt;
&lt;p&gt;
Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at [http://www.w3.org/TR/2000/REC-xml-20001006], and it has grammar in Extended Backus-Naur Form (&lt;code&gt;name ::= definition&lt;/code&gt;). I started to translate:
&lt;readmore&gt;
&lt;code&gt;
document ::=    prolog element Misc*
Char     ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
S        ::=    (#x20 | #x9 | #xD | #xA)+
etcetera, etcetera
&lt;/code&gt;
into:
&lt;code&gt;
rule document { &lt;prolog&gt; &lt;element&gt; &lt;Misc&gt;* }
rule Char     { &lt;[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]&gt; }
rule S        { &lt;[\x20\x9\xD\xA]&gt;+ }
rule NameChar { &lt;Letter&gt; | &lt;Digit&gt; | &lt;[._:-]&gt; | &lt;CombiningChar&gt; | &lt;Extender&gt; }
rule Name     { [ &lt;Letter | &lt;[_:]&gt; ] &lt;NameChar&gt;* }
rule Names    { &lt;Name&gt; [&lt;S&gt; &lt;Name&gt;]* }

rule Nmtoken  { &lt;NameChar&gt;+ }
rule Nmtokens { &lt;Nmtoken&gt; [&lt;S&gt; &lt;Nmtoken&gt;]* }
etcetera, etcetera
&lt;/code&gt;
&lt;/p&gt;
&lt;p&gt;
Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:
&lt;code&gt;
#!/usr/bin/perl -w

use strict;
use XML::TreeBuilder;

my $tree = XML::TreeBuilder-&gt;new;

$tree-&gt;parse_file('rec');
# http://www.w3.org/TR/2000/REC-xml-20001006.xml

for ($tree-&gt;look_down(_tag =&gt; 'prod')) {
    print 'rule ';
    print $_-&gt;look_down(_tag =&gt; 'lhs')-&gt;content_list;
    print " {\n    ";

    my $not = 0;
    my $p = 1;

    my @foo = $_-&gt;look_down(_tag =&gt; 'rhs');
    for (map { $_-&gt;content_list } @foo) {
         if (ref) {
             if (my $t = $_-&gt;look_down(_tag =&gt; 'nt')) {
                  print '&lt;', $t-&gt;content_list, '&gt;';
             }

             $not = 0, print '&gt;' if $not and $p == $not;
             next;
         }
         {
             s/\xC2\xA0/ /g; # Non breaking space
             s/\s+/ /g;
             s/^S//g and do {
                  print '&lt;S&gt;';
                  $not = 0, print '&gt;' if $not and $p == $not;
                  next;
             };
             s/^'([^']*)'// || s/^"([^"]*)"// and do {
                  my $m = $1;
                  if ($m =~ /[\@\$%&lt;&gt;:\\.]/) {
                      $m =~ s/\\/\\\\/g;
                      $m =~ s/'/\\'/g;
                      print "&lt;'$m'&gt;";
                  } else {
                      print $m;
                  }
                  $not = 0, print '&gt;' if $not and $p == $not;
                  redo;
             };
             s/^\[([^]]*)\]// and do {
                  my $m = $1;
                  $m =~ s/#x/\\x/g;
		  my $neg = (($m =~ s/^\^//) ? '-' : '');
                  print "&lt;$neg\[$m]&gt;";
                  $not = 0, print '&gt;' if $not and $p == $not;
                  redo;
             };
             s/^#x([0-9A-Fa-f]+)// and do {
                  print "\\x$1";
                  $not = 0, print '&gt;' if $not and $p == $not;
                  redo;
             };
             s/^-// and do {
                  $not = $p;
                  print '&lt;!after';
                  redo;
             };
             s/^\(// and do {
                  print '[';
                  ++$p;
                  redo;
             };
             s/^\)// and do {
                  print ']';
                  --$p;
                  $not = 0, print '&gt;' if $not and $p == $not;
                  redo;
             };
             s/^[|*+? ]+// and do {
                  (my $x = $&amp;) =~ tr/()/[]/;
                  print $x;
                  redo;
             };
             next if not length;
             print "ERROR: $_\n";
         }
         
    }

    print "\n}\n\n";
}
&lt;/code&gt;
Okay, let's see: there's a tree that is walked, printing the new version on the fly.  This is done in an ugly way, but I think that's allowed for quick hacks.
&lt;/p&gt;
&lt;p&gt;
And here is what it magically spit out:
&lt;code&gt;
rule document {
    &lt;prolog&gt; &lt;element&gt; &lt;Misc&gt;*
}

rule Char {
    \x9 | \xA | \xD | &lt;[\x20-\xD7FF]&gt; | &lt;[\xE000-\xFFFD]&gt; | &lt;[\x10000-\x10FFFF]&gt;
}

rule S {
    [\x20 | \x9 | \xD | \xA]+
}

rule NameChar {
    &lt;Letter&gt; | &lt;Digit&gt; | &lt;'.'&gt; | - | _ | &lt;':'&gt; | &lt;CombiningChar&gt; | &lt;Extender&gt;
}

rule Name {
    [&lt;Letter&gt; | _ | &lt;':'&gt;] [&lt;NameChar&gt;]*
}

rule Names {
    &lt;Name&gt; [&lt;S&gt; &lt;Name&gt;]*
}

rule Nmtoken {
    [&lt;NameChar&gt;]+
}

rule Nmtokens {
    &lt;Nmtoken&gt; [&lt;S&gt; &lt;Nmtoken&gt;]*
}

rule EntityValue {
    " [&lt;-[%&amp;"]&gt; | &lt;PEReference&gt; | &lt;Reference&gt;]* " | ' [&lt;-[%&amp;']&gt; | &lt;PEReference&gt; | &lt;Reference&gt;]* '
}

rule AttValue {
    " [&lt;-[&lt;&amp;"]&gt; | &lt;Reference&gt;]* " | ' [&lt;-[&lt;&amp;']&gt; | &lt;Reference&gt;]* '
}

rule SystemLiteral {
    [" &lt;-["]&gt;* "] | [' &lt;-[']&gt;* '] 
}

rule PubidLiteral {
    " &lt;PubidChar&gt;* " | ' [&lt;PubidChar&gt; &lt;!after '&gt;]* '
}

rule PubidChar {
    \x20 | \xD | \xA | &lt;[a-zA-Z0-9]&gt; | &lt;[-'()+,./:=?;!*#@$_%]&gt;
}

rule CharData {
    &lt;-[&lt;&amp;]&gt;* &lt;!after [&lt;-[&lt;&amp;]&gt;* &lt;']]&gt;'&gt; &lt;-[&lt;&amp;]&gt;*]&gt;
}

rule Comment {
    &lt;'&lt;!--'&gt; [[&lt;Char&gt; &lt;!after -&gt;] | [- [&lt;Char&gt; &lt;!after -&gt;]]]* &lt;'--&gt;'&gt;
}

rule PI {
    &lt;'&lt;?'&gt; &lt;PITarget&gt; [&lt;S&gt; [&lt;Char&gt;* &lt;!after [&lt;Char&gt;* &lt;'?&gt;'&gt; &lt;Char&gt;*]&gt;]]? &lt;'?&gt;'&gt;
}

rule PITarget {
    &lt;Name&gt; &lt;!after [[X | x] [M | m] [L | l]]&gt;
}

rule CDSect {
    &lt;CDStart&gt; &lt;CData&gt; &lt;CDEnd&gt;
}

rule CDStart {
    &lt;'&lt;![CDATA['&gt;
}

rule CData {
    [&lt;Char&gt;* &lt;!after [&lt;Char&gt;* &lt;']]&gt;'&gt; &lt;Char&gt;*]&gt;] 
}

rule CDEnd {
    &lt;']]&gt;'&gt;
}

rule prolog {
    &lt;XMLDecl&gt;? &lt;Misc&gt;* [&lt;doctypedecl&gt; &lt;Misc&gt;*]?
}

rule XMLDecl {
    &lt;'&lt;?xml'&gt; &lt;VersionInfo&gt; &lt;EncodingDecl&gt;? &lt;SDDecl&gt;? &lt;S&gt;? &lt;'?&gt;'&gt;
}

rule VersionInfo {
    &lt;S&gt; version &lt;Eq&gt; [' &lt;VersionNum&gt; ' | " &lt;VersionNum&gt; "]
}

rule Eq {
    &lt;S&gt;? = &lt;S&gt;?
}

rule VersionNum {
    [&lt;[a-zA-Z0-9_.:]&gt; | -]+
}

rule Misc {
    &lt;Comment&gt; | &lt;PI&gt; | &lt;S&gt;
}

rule doctypedecl {
    &lt;'&lt;!DOCTYPE'&gt; &lt;S&gt; &lt;Name&gt; [&lt;S&gt; &lt;ExternalID&gt;]? &lt;S&gt;? [[ [&lt;markupdecl&gt; | &lt;DeclSep&gt;]* ] &lt;S&gt;?]? &lt;'&gt;'&gt;
}

rule DeclSep {
    &lt;PEReference&gt; | &lt;S&gt;
}

rule markupdecl {
    &lt;elementdecl&gt; | &lt;AttlistDecl&gt; | &lt;EntityDecl&gt; | &lt;NotationDecl&gt; | &lt;PI&gt; | &lt;Comment&gt; 
}

rule extSubset {
    &lt;TextDecl&gt;? &lt;extSubsetDecl&gt;
}

rule extSubsetDecl {
    [ &lt;markupdecl&gt; | &lt;conditionalSect&gt; | &lt;DeclSep&gt;]*
}

rule SDDecl {
     &lt;S&gt; standalone &lt;Eq&gt; [[' [yes | no] '] | [" [yes | no] "]] 
}

rule LanguageID {
    &lt;Langcode&gt; [- &lt;Subcode&gt;]*
}

rule Langcode {
    &lt;ISO639Code&gt; | &lt;IanaCode&gt; | &lt;UserCode&gt;
}

rule ISO639Code {
    [&lt;[a-z]&gt; | &lt;[A-Z]&gt;] [&lt;[a-z]&gt; | &lt;[A-Z]&gt;]
}

rule IanaCode {
    [i | I] - [&lt;[a-z]&gt; | &lt;[A-Z]&gt;]+
}

rule UserCode {
    [x | X] - [&lt;[a-z]&gt; | &lt;[A-Z]&gt;]+
}

rule Subcode {
    [&lt;[a-z]&gt; | &lt;[A-Z]&gt;]+
}

rule element {
    &lt;EmptyElemTag&gt;| &lt;STag&gt; &lt;content&gt; &lt;ETag&gt;
}

rule STag {
    &lt;'&lt;'&gt; &lt;Name&gt; [&lt;S&gt; &lt;Attribute&gt;]* &lt;S&gt;? &lt;'&gt;'&gt;
}

rule Attribute {
    &lt;Name&gt; &lt;Eq&gt; &lt;AttValue&gt;
}

rule ETag {
    &lt;'&lt;/'&gt; &lt;Name&gt; &lt;S&gt;? &lt;'&gt;'&gt;
}

rule content {
    &lt;CharData&gt;? [[&lt;element&gt; | &lt;Reference&gt; | &lt;CDSect&gt; | &lt;PI&gt; | &lt;Comment&gt;] &lt;CharData&gt;?]*
}

rule EmptyElemTag {
    &lt;'&lt;'&gt; &lt;Name&gt; [&lt;S&gt; &lt;Attribute&gt;]* &lt;S&gt;? &lt;'/&gt;'&gt;
}

rule elementdecl {
    &lt;'&lt;!ELEMENT'&gt; &lt;S&gt; &lt;Name&gt; &lt;S&gt; &lt;contentspec&gt; &lt;S&gt;? &lt;'&gt;'&gt;
}

rule contentspec {
    EMPTY | ANY | &lt;Mixed&gt; | &lt;children&gt; 
}

rule children {
    [&lt;choice&gt; | &lt;seq&gt;] [? | * | +]?
}

rule cp {
    [&lt;Name&gt; | &lt;choice&gt; | &lt;seq&gt;] [? | * | +]?
}

rule choice {
    ( &lt;S&gt;? &lt;cp&gt; [ &lt;S&gt;? | &lt;S&gt;? &lt;cp&gt; ]+ &lt;S&gt;? )
}

rule seq {
    ( &lt;S&gt;? &lt;cp&gt; [ &lt;S&gt;? , &lt;S&gt;? &lt;cp&gt; ]* &lt;S&gt;? )
}

rule Mixed {
    ( &lt;S&gt;? #PCDATA [&lt;S&gt;? | &lt;S&gt;? &lt;Name&gt;]* &lt;S&gt;? )* | ( &lt;S&gt;? #PCDATA &lt;S&gt;? ) 
}

rule AttlistDecl {
    &lt;'&lt;!ATTLIST'&gt; &lt;S&gt; &lt;Name&gt; &lt;AttDef&gt;* &lt;S&gt;? &lt;'&gt;'&gt;
}

rule AttDef {
    &lt;S&gt; &lt;Name&gt; &lt;S&gt; &lt;AttType&gt; &lt;S&gt; &lt;DefaultDecl&gt;
}

rule AttType {
    &lt;StringType&gt; | &lt;TokenizedType&gt; | &lt;EnumeratedType&gt; 
}

rule StringType {
    CDATA
}

rule TokenizedType {
    ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS
}

rule EnumeratedType {
    &lt;NotationType&gt; | &lt;Enumeration&gt; 
}

rule NotationType {
    NOTATION &lt;S&gt; ( &lt;S&gt;? &lt;Name&gt; [&lt;S&gt;? | &lt;S&gt;? &lt;Name&gt;]* &lt;S&gt;? ) 
}

rule Enumeration {
    ( &lt;S&gt;? &lt;Nmtoken&gt; [&lt;S&gt;? | &lt;S&gt;? &lt;Nmtoken&gt;]* &lt;S&gt;? )
}

rule DefaultDecl {
    #REQUIRED | #IMPLIED | [[#FIXED &lt;S&gt;&lt;AttValue&gt;]
}

rule conditionalSect {
    &lt;includeSect&gt; | &lt;ignoreSect&gt; 
}

rule includeSect {
    &lt;'&lt;!['&gt; &lt;S&gt;&lt;extSubsetDecl&gt; &lt;']]&gt;'&gt; 
}

rule ignoreSect {
    &lt;'&lt;!['&gt; &lt;S&gt;&lt;ignoreSectContents&gt;* &lt;']]&gt;'&gt;
}

rule ignoreSectContents {
    &lt;Ignore&gt; [&lt;'&lt;!['&gt; &lt;ignoreSectContents&gt; &lt;']]&gt;'&gt; &lt;Ignore&gt;]*
}

rule Ignore {
    &lt;Char&gt;* &lt;!after [&lt;Char&gt;* [&lt;'&lt;!['&gt; | &lt;']]&gt;'&gt;] &lt;Char&gt;*]&gt; 
}

rule CharRef {
    &amp;# &lt;[0-9]&gt;+ ; | &amp;#x &lt;[0-9a-fA-F]&gt;+ ;
}

rule Reference {
    &lt;EntityRef&gt; | &lt;CharRef&gt;
}

rule EntityRef {
    &amp; &lt;Name&gt; ;
}

rule PEReference {
    &lt;'%'&gt; &lt;Name&gt; ;
}

rule EntityDecl {
    &lt;GEDecl&gt; | &lt;PEDecl&gt;
}

rule GEDecl {
    &lt;'&lt;!ENTITY'&gt; &lt;S&gt; &lt;Name&gt; &lt;S&gt; &lt;EntityDef&gt; &lt;S&gt;? &lt;'&gt;'&gt;
}

rule PEDecl {
    &lt;'&lt;!ENTITY'&gt; &lt;S&gt; &lt;'%'&gt; &lt;S&gt; &lt;Name&gt; &lt;S&gt; &lt;PEDef&gt; &lt;S&gt;? &lt;'&gt;'&gt;
}

rule EntityDef {
    &lt;EntityValue&gt; | [&lt;ExternalID&gt; &lt;NDataDecl&gt;?]
}

rule PEDef {
    &lt;EntityValue&gt; | &lt;ExternalID&gt;
}

rule ExternalID {
    SYSTEM &lt;S&gt; &lt;SystemLiteral&gt;| PUBLIC &lt;S&gt; &lt;PubidLiteral&gt; &lt;S&gt; &lt;SystemLiteral&gt; 
}

rule NDataDecl {
    &lt;S&gt; NDATA &lt;S&gt; &lt;Name&gt;
}

rule TextDecl {
    &lt;'&lt;?xml'&gt; &lt;VersionInfo&gt;? &lt;EncodingDecl&gt; &lt;S&gt;? &lt;'?&gt;'&gt;
}

rule extParsedEnt {
    &lt;TextDecl&gt;? &lt;content&gt;
}

rule extPE {
    &lt;TextDecl&gt;? &lt;extSubsetDecl&gt;
}

rule EncodingDecl {
    &lt;S&gt; encoding &lt;Eq&gt; [" &lt;EncName&gt; " | ' &lt;EncName&gt; ' ] 
}

rule EncName {
    &lt;[A-Za-z]&gt; [&lt;[A-Za-z0-9._]&gt; | -]*
}

rule NotationDecl {
    &lt;'&lt;!NOTATION'&gt; &lt;S&gt; &lt;Name&gt; &lt;S&gt; [&lt;ExternalID&gt; | &lt;PublicID&gt;] &lt;S&gt;? &lt;'&gt;'&gt;
}

rule PublicID {
    PUBLIC &lt;S&gt; &lt;PubidLiteral&gt; 
}

rule Letter {
    &lt;BaseChar&gt; | &lt;Ideographic&gt;
}

rule BaseChar {
    # Large block manually removed.
}

rule Ideographic {
    &lt;[\x4E00-\x9FA5]&gt; | \x3007 | &lt;[\x3021-\x3029]&gt; 
}

rule CombiningChar {
    # Large block manually removed.
}

rule Digit {
    # Large block manually removed.
}

rule Extender {
    # Large block manually removed.
}

&lt;/code&gt;

I'm not sure about the &lt;code&gt;&lt;!after&gt;&lt;/code&gt; parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.
&lt;/p&gt;
&lt;p&gt;
Oh, it's not finished. It needs &lt;code&gt;grammar XML { ... }&lt;/code&gt; wrapped around it. I leave that as an excercise for the reader ;)
&lt;/p&gt;
&lt;p&gt;
I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...
&lt;/p&gt;</field>
</data>
</node>
