comment on

Updates - several code changes. There are probably more bugs to be fixed... (Geez, I'm fixing bugs in code that can't even be executed? :)

After reading Parsing with Perl 6, I wanted to write Perl 6 regexes too. Just for fun, and to get used to the new powerful regex language.

I once wrote a very simple xml-ish well formedness checking regex, but it uses PCRE with a patch, so it isn't really useful. Let's turn it into something even less useful, I thought. So I translated it to Perl 6:

rule xml :i {
    ^
    \s*
    [
        # Single tags like <foo/>
        \<
        \s*
        <[a-z:]>+
        (?:
            \s*<[a-z:]>*
            \s* = \s*
            (?:' <[^']>* ' | " <[^"]>* ")
        )*
        \s*
        /\s*
        \>
    |
        # Tags in pairs like <foo>content</foo>
        \<
        \s*
        $1 := (<[a-z:]>+)
        [
            \s*<[a-z:]>*
            \s* = \s*
            [ ' <[^']>* ' | " <[^"]>* " ]
        ]*
        \s*
        \>

        [ <[^<>]>* | <xml> ]*

        \< \s* / \s* $1 \s* \>
    ]
    \s*
}
[download]

Now that was child's play. It's only a few bytes different from the piece of code I wrote before.

Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at http://www.w3.org/TR/2000/REC-xml-20001006, and it has grammar in Extended Backus-Naur Form (name ::= definition). I started to translate:

document ::=    prolog element Misc*
Char     ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#
+x10000-#x10FFFF]
S        ::=    (#x20 | #x9 | #xD | #xA)+
etcetera, etcetera
[download]

into:

rule document { <prolog> <element> <Misc>* }
rule Char     { <[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]> }
rule S        { <[\x20\x9\xD\xA]>+ }
rule NameChar { <Letter> | <Digit> | <[._:-]> | <CombiningChar> | <Ext
+ender> }
rule Name     { [ <Letter | <[_:]> ] <NameChar>* }
rule Names    { <Name> [<S> <Name>]* }

rule Nmtoken  { <NameChar>+ }
rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* }
etcetera, etcetera
[download]

Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:

#!/usr/bin/perl -w

use strict;
use XML::TreeBuilder;

my $tree = XML::TreeBuilder->new;

$tree->parse_file('rec');
# http://www.w3.org/TR/2000/REC-xml-20001006.xml

for ($tree->look_down(_tag => 'prod')) {
    print 'rule ';
    print $_->look_down(_tag => 'lhs')->content_list;
    print " {\n    ";

    my $not = 0;
    my $p = 1;

    my @foo = $_->look_down(_tag => 'rhs');
    for (map { $_->content_list } @foo) {
         if (ref) {
             if (my $t = $_->look_down(_tag => 'nt')) {
                  print '<', $t->content_list, '>';
             }

             $not = 0, print '>' if $not and $p == $not;
             next;
         }
         {
             s/\xC2\xA0/ /g; # Non breaking space
             s/\s+/ /g;
             s/^S//g and do {
                  print '<S>';
                  $not = 0, print '>' if $not and $p == $not;
                  next;
             };
             s/^'([^']*)'// || s/^"([^"]*)"// and do {
                  my $m = $1;
                  if ($m =~ /[\@\$%<>:\\.]/) {
                      $m =~ s/\\/\\\\/g;
                      $m =~ s/'/\\'/g;
                      print "<'$m'>";
                  } else {
                      print $m;
                  }
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^\[([^]]*)\]// and do {
                  my $m = $1;
                  $m =~ s/#x/\\x/g;
          my $neg = (($m =~ s/^\^//) ? '-' : '');
                  print "<$neg\[$m]>";
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^#x([0-9A-Fa-f]+)// and do {
                  print "\\x$1";
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^-// and do {
                  $not = $p;
                  print '<!after';
                  redo;
             };
             s/^\(// and do {
                  print '[';
                  ++$p;
                  redo;
             };
             s/^\)// and do {
                  print ']';
                  --$p;
                  $not = 0, print '>' if $not and $p == $not;
                  redo;
             };
             s/^[|*+? ]+// and do {
                  (my $x = $&) =~ tr/()/[]/;
                  print $x;
                  redo;
             };
             next if not length;
             print "ERROR: $_\n";
         }
         
    }

    print "\n}\n\n";
}
[download]

Okay, let's see: there's a tree that is walked, printing the new version on the fly. This is done in an ugly way, but I think that's allowed for quick hacks.

And here is what it magically spit out:

rule document {
    <prolog> <element> <Misc>*
}

rule Char {
    \x9 | \xA | \xD | <[\x20-\xD7FF]> | <[\xE000-\xFFFD]> | <[\x10000-
+\x10FFFF]>
}

rule S {
    [\x20 | \x9 | \xD | \xA]+
}

rule NameChar {
    <Letter> | <Digit> | <'.'> | - | _ | <':'> | <CombiningChar> | <Ex
+tender>
}

rule Name {
    [<Letter> | _ | <':'>] [<NameChar>]*
}

rule Names {
    <Name> [<S> <Name>]*
}

rule Nmtoken {
    [<NameChar>]+
}

rule Nmtokens {
    <Nmtoken> [<S> <Nmtoken>]*
}

rule EntityValue {
    " [<-[%&"]> | <PEReference> | <Reference>]* " | ' [<-[%&']> | <PER
+eference> | <Reference>]* '
}

rule AttValue {
    " [<-[<&"]> | <Reference>]* " | ' [<-[<&']> | <Reference>]* '
}

rule SystemLiteral {
    [" <-["]>* "] | [' <-[']>* '] 
}

rule PubidLiteral {
    " <PubidChar>* " | ' [<PubidChar> <!after '>]* '
}

rule PubidChar {
    \x20 | \xD | \xA | <[a-zA-Z0-9]> | <[-'()+,./:=?;!*#@$_%]>
}

rule CharData {
    <-[<&]>* <!after [<-[<&]>* <']]>'> <-[<&]>*]>
}

rule Comment {
    <'<!--'> [[<Char> <!after ->] | [- [<Char> <!after ->]]]* <'-->'>
}

rule PI {
    <'<?'> <PITarget> [<S> [<Char>* <!after [<Char>* <'?>'> <Char>*]>]
+]? <'?>'>
}

rule PITarget {
    <Name> <!after [[X | x] [M | m] [L | l]]>
}

rule CDSect {
    <CDStart> <CData> <CDEnd>
}

rule CDStart {
    <'<![CDATA['>
}

rule CData {
    [<Char>* <!after [<Char>* <']]>'> <Char>*]>] 
}

rule CDEnd {
    <']]>'>
}

rule prolog {
    <XMLDecl>? <Misc>* [<doctypedecl> <Misc>*]?
}

rule XMLDecl {
    <'<?xml'> <VersionInfo> <EncodingDecl>? <SDDecl>? <S>? <'?>'>
}

rule VersionInfo {
    <S> version <Eq> [' <VersionNum> ' | " <VersionNum> "]
}

rule Eq {
    <S>? = <S>?
}

rule VersionNum {
    [<[a-zA-Z0-9_.:]> | -]+
}

rule Misc {
    <Comment> | <PI> | <S>
}

rule doctypedecl {
    <'<!DOCTYPE'> <S> <Name> [<S> <ExternalID>]? <S>? [[ [<markupdecl>
+ | <DeclSep>]* ] <S>?]? <'>'>
}

rule DeclSep {
    <PEReference> | <S>
}

rule markupdecl {
    <elementdecl> | <AttlistDecl> | <EntityDecl> | <NotationDecl> | <P
+I> | <Comment> 
}

rule extSubset {
    <TextDecl>? <extSubsetDecl>
}

rule extSubsetDecl {
    [ <markupdecl> | <conditionalSect> | <DeclSep>]*
}

rule SDDecl {
     <S> standalone <Eq> [[' [yes | no] '] | [" [yes | no] "]] 
}

rule LanguageID {
    <Langcode> [- <Subcode>]*
}

rule Langcode {
    <ISO639Code> | <IanaCode> | <UserCode>
}

rule ISO639Code {
    [<[a-z]> | <[A-Z]>] [<[a-z]> | <[A-Z]>]
}

rule IanaCode {
    [i | I] - [<[a-z]> | <[A-Z]>]+
}

rule UserCode {
    [x | X] - [<[a-z]> | <[A-Z]>]+
}

rule Subcode {
    [<[a-z]> | <[A-Z]>]+
}

rule element {
    <EmptyElemTag>| <STag> <content> <ETag>
}

rule STag {
    <'<'> <Name> [<S> <Attribute>]* <S>? <'>'>
}

rule Attribute {
    <Name> <Eq> <AttValue>
}

rule ETag {
    <'</'> <Name> <S>? <'>'>
}

rule content {
    <CharData>? [[<element> | <Reference> | <CDSect> | <PI> | <Comment
+>] <CharData>?]*
}

rule EmptyElemTag {
    <'<'> <Name> [<S> <Attribute>]* <S>? <'/>'>
}

rule elementdecl {
    <'<!ELEMENT'> <S> <Name> <S> <contentspec> <S>? <'>'>
}

rule contentspec {
    EMPTY | ANY | <Mixed> | <children> 
}

rule children {
    [<choice> | <seq>] [? | * | +]?
}

rule cp {
    [<Name> | <choice> | <seq>] [? | * | +]?
}

rule choice {
    ( <S>? <cp> [ <S>? | <S>? <cp> ]+ <S>? )
}

rule seq {
    ( <S>? <cp> [ <S>? , <S>? <cp> ]* <S>? )
}

rule Mixed {
    ( <S>? #PCDATA [<S>? | <S>? <Name>]* <S>? )* | ( <S>? #PCDATA <S>?
+ ) 
}

rule AttlistDecl {
    <'<!ATTLIST'> <S> <Name> <AttDef>* <S>? <'>'>
}

rule AttDef {
    <S> <Name> <S> <AttType> <S> <DefaultDecl>
}

rule AttType {
    <StringType> | <TokenizedType> | <EnumeratedType> 
}

rule StringType {
    CDATA
}

rule TokenizedType {
    ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS
}

rule EnumeratedType {
    <NotationType> | <Enumeration> 
}

rule NotationType {
    NOTATION <S> ( <S>? <Name> [<S>? | <S>? <Name>]* <S>? ) 
}

rule Enumeration {
    ( <S>? <Nmtoken> [<S>? | <S>? <Nmtoken>]* <S>? )
}

rule DefaultDecl {
    #REQUIRED | #IMPLIED | [[#FIXED <S><AttValue>]
}

rule conditionalSect {
    <includeSect> | <ignoreSect> 
}

rule includeSect {
    <'<!['> <S><extSubsetDecl> <']]>'> 
}

rule ignoreSect {
    <'<!['> <S><ignoreSectContents>* <']]>'>
}

rule ignoreSectContents {
    <Ignore> [<'<!['> <ignoreSectContents> <']]>'> <Ignore>]*
}

rule Ignore {
    <Char>* <!after [<Char>* [<'<!['> | <']]>'>] <Char>*]> 
}

rule CharRef {
    &# <[0-9]>+ ; | &#x <[0-9a-fA-F]>+ ;
}

rule Reference {
    <EntityRef> | <CharRef>
}

rule EntityRef {
    & <Name> ;
}

rule PEReference {
    <'%'> <Name> ;
}

rule EntityDecl {
    <GEDecl> | <PEDecl>
}

rule GEDecl {
    <'<!ENTITY'> <S> <Name> <S> <EntityDef> <S>? <'>'>
}

rule PEDecl {
    <'<!ENTITY'> <S> <'%'> <S> <Name> <S> <PEDef> <S>? <'>'>
}

rule EntityDef {
    <EntityValue> | [<ExternalID> <NDataDecl>?]
}

rule PEDef {
    <EntityValue> | <ExternalID>
}

rule ExternalID {
    SYSTEM <S> <SystemLiteral>| PUBLIC <S> <PubidLiteral> <S> <SystemL
+iteral> 
}

rule NDataDecl {
    <S> NDATA <S> <Name>
}

rule TextDecl {
    <'<?xml'> <VersionInfo>? <EncodingDecl> <S>? <'?>'>
}

rule extParsedEnt {
    <TextDecl>? <content>
}

rule extPE {
    <TextDecl>? <extSubsetDecl>
}

rule EncodingDecl {
    <S> encoding <Eq> [" <EncName> " | ' <EncName> ' ] 
}

rule EncName {
    <[A-Za-z]> [<[A-Za-z0-9._]> | -]*
}

rule NotationDecl {
    <'<!NOTATION'> <S> <Name> <S> [<ExternalID> | <PublicID>] <S>? <'>
+'>
}

rule PublicID {
    PUBLIC <S> <PubidLiteral> 
}

rule Letter {
    <BaseChar> | <Ideographic>
}

rule BaseChar {
    # Large block manually removed.
}

rule Ideographic {
    <[\x4E00-\x9FA5]> | \x3007 | <[\x3021-\x3029]> 
}

rule CombiningChar {
    # Large block manually removed.
}

rule Digit {
    # Large block manually removed.
}

rule Extender {
    # Large block manually removed.
}
[download]

I'm not sure about the <!after> parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.

Oh, it's not finished. It needs grammar XML { ... } wrapped around it. I leave that as an excercise for the reader ;)

I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...

In reply to Parsing EBNF with Perl 5 to eventually parse XML with Perl 6 by Juerd

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks