http://www.perlmonks.org?node_id=179755

Updates - several code changes. There are probably more bugs to be fixed... (Geez, I'm fixing bugs in code that can't even be executed? :)

After reading Parsing with Perl 6, I wanted to write Perl 6 regexes too. Just for fun, and to get used to the new powerful regex language.

I once wrote a very simple xml-ish well formedness checking regex, but it uses PCRE with a patch, so it isn't really useful. Let's turn it into something even less useful, I thought. So I translated it to Perl 6:

rule xml :i { ^ \s* [ # Single tags like <foo/> \< \s* <[a-z:]>+ (?: \s*<[a-z:]>* \s* = \s* (?:' <[^']>* ' | " <[^"]>* ") )* \s* /\s* \> | # Tags in pairs like <foo>content</foo> \< \s* $1 := (<[a-z:]>+) [ \s*<[a-z:]>* \s* = \s* [ ' <[^']>* ' | " <[^"]>* " ] ]* \s* \> [ <[^<>]>* | <xml> ]* \< \s* / \s* $1 \s* \> ] \s* }
Now that was child's play. It's only a few bytes different from the piece of code I wrote before.

Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at http://www.w3.org/TR/2000/REC-xml-20001006, and it has grammar in Extended Backus-Naur Form (name ::= definition). I started to translate:

document ::= prolog element Misc* Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [# +x10000-#x10FFFF] S ::= (#x20 | #x9 | #xD | #xA)+ etcetera, etcetera
into:
rule document { <prolog> <element> <Misc>* } rule Char { <[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]> } rule S { <[\x20\x9\xD\xA]>+ } rule NameChar { <Letter> | <Digit> | <[._:-]> | <CombiningChar> | <Ext +ender> } rule Name { [ <Letter | <[_:]> ] <NameChar>* } rule Names { <Name> [<S> <Name>]* } rule Nmtoken { <NameChar>+ } rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* } etcetera, etcetera

Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:

#!/usr/bin/perl -w use strict; use XML::TreeBuilder; my $tree = XML::TreeBuilder->new; $tree->parse_file('rec'); # http://www.w3.org/TR/2000/REC-xml-20001006.xml for ($tree->look_down(_tag => 'prod')) { print 'rule '; print $_->look_down(_tag => 'lhs')->content_list; print " {\n "; my $not = 0; my $p = 1; my @foo = $_->look_down(_tag => 'rhs'); for (map { $_->content_list } @foo) { if (ref) { if (my $t = $_->look_down(_tag => 'nt')) { print '<', $t->content_list, '>'; } $not = 0, print '>' if $not and $p == $not; next; } { s/\xC2\xA0/ /g; # Non breaking space s/\s+/ /g; s/^S//g and do { print '<S>'; $not = 0, print '>' if $not and $p == $not; next; }; s/^'([^']*)'// || s/^"([^"]*)"// and do { my $m = $1; if ($m =~ /[\@\$%<>:\\.]/) { $m =~ s/\\/\\\\/g; $m =~ s/'/\\'/g; print "<'$m'>"; } else { print $m; } $not = 0, print '>' if $not and $p == $not; redo; }; s/^\[([^]]*)\]// and do { my $m = $1; $m =~ s/#x/\\x/g; my $neg = (($m =~ s/^\^//) ? '-' : ''); print "<$neg\[$m]>"; $not = 0, print '>' if $not and $p == $not; redo; }; s/^#x([0-9A-Fa-f]+)// and do { print "\\x$1"; $not = 0, print '>' if $not and $p == $not; redo; }; s/^-// and do { $not = $p; print '<!after'; redo; }; s/^\(// and do { print '['; ++$p; redo; }; s/^\)// and do { print ']'; --$p; $not = 0, print '>' if $not and $p == $not; redo; }; s/^[|*+? ]+// and do { (my $x = $&) =~ tr/()/[]/; print $x; redo; }; next if not length; print "ERROR: $_\n"; } } print "\n}\n\n"; }
Okay, let's see: there's a tree that is walked, printing the new version on the fly. This is done in an ugly way, but I think that's allowed for quick hacks.

And here is what it magically spit out:

rule document { <prolog> <element> <Misc>* } rule Char { \x9 | \xA | \xD | <[\x20-\xD7FF]> | <[\xE000-\xFFFD]> | <[\x10000- +\x10FFFF]> } rule S { [\x20 | \x9 | \xD | \xA]+ } rule NameChar { <Letter> | <Digit> | <'.'> | - | _ | <':'> | <CombiningChar> | <Ex +tender> } rule Name { [<Letter> | _ | <':'>] [<NameChar>]* } rule Names { <Name> [<S> <Name>]* } rule Nmtoken { [<NameChar>]+ } rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* } rule EntityValue { " [<-[%&"]> | <PEReference> | <Reference>]* " | ' [<-[%&']> | <PER +eference> | <Reference>]* ' } rule AttValue { " [<-[<&"]> | <Reference>]* " | ' [<-[<&']> | <Reference>]* ' } rule SystemLiteral { [" <-["]>* "] | [' <-[']>* '] } rule PubidLiteral { " <PubidChar>* " | ' [<PubidChar> <!after '>]* ' } rule PubidChar { \x20 | \xD | \xA | <[a-zA-Z0-9]> | <[-'()+,./:=?;!*#@$_%]> } rule CharData { <-[<&]>* <!after [<-[<&]>* <']]>'> <-[<&]>*]> } rule Comment { <'<!--'> [[<Char> <!after ->] | [- [<Char> <!after ->]]]* <'-->'> } rule PI { <'<?'> <PITarget> [<S> [<Char>* <!after [<Char>* <'?>'> <Char>*]>] +]? <'?>'> } rule PITarget { <Name> <!after [[X | x] [M | m] [L | l]]> } rule CDSect { <CDStart> <CData> <CDEnd> } rule CDStart { <'<![CDATA['> } rule CData { [<Char>* <!after [<Char>* <']]>'> <Char>*]>] } rule CDEnd { <']]>'> } rule prolog { <XMLDecl>? <Misc>* [<doctypedecl> <Misc>*]? } rule XMLDecl { <'<?xml'> <VersionInfo> <EncodingDecl>? <SDDecl>? <S>? <'?>'> } rule VersionInfo { <S> version <Eq> [' <VersionNum> ' | " <VersionNum> "] } rule Eq { <S>? = <S>? } rule VersionNum { [<[a-zA-Z0-9_.:]> | -]+ } rule Misc { <Comment> | <PI> | <S> } rule doctypedecl { <'<!DOCTYPE'> <S> <Name> [<S> <ExternalID>]? <S>? [[ [<markupdecl> + | <DeclSep>]* ] <S>?]? <'>'> } rule DeclSep { <PEReference> | <S> } rule markupdecl { <elementdecl> | <AttlistDecl> | <EntityDecl> | <NotationDecl> | <P +I> | <Comment> } rule extSubset { <TextDecl>? <extSubsetDecl> } rule extSubsetDecl { [ <markupdecl> | <conditionalSect> | <DeclSep>]* } rule SDDecl { <S> standalone <Eq> [[' [yes | no] '] | [" [yes | no] "]] } rule LanguageID { <Langcode> [- <Subcode>]* } rule Langcode { <ISO639Code> | <IanaCode> | <UserCode> } rule ISO639Code { [<[a-z]> | <[A-Z]>] [<[a-z]> | <[A-Z]>] } rule IanaCode { [i | I] - [<[a-z]> | <[A-Z]>]+ } rule UserCode { [x | X] - [<[a-z]> | <[A-Z]>]+ } rule Subcode { [<[a-z]> | <[A-Z]>]+ } rule element { <EmptyElemTag>| <STag> <content> <ETag> } rule STag { <'<'> <Name> [<S> <Attribute>]* <S>? <'>'> } rule Attribute { <Name> <Eq> <AttValue> } rule ETag { <'</'> <Name> <S>? <'>'> } rule content { <CharData>? [[<element> | <Reference> | <CDSect> | <PI> | <Comment +>] <CharData>?]* } rule EmptyElemTag { <'<'> <Name> [<S> <Attribute>]* <S>? <'/>'> } rule elementdecl { <'<!ELEMENT'> <S> <Name> <S> <contentspec> <S>? <'>'> } rule contentspec { EMPTY | ANY | <Mixed> | <children> } rule children { [<choice> | <seq>] [? | * | +]? } rule cp { [<Name> | <choice> | <seq>] [? | * | +]? } rule choice { ( <S>? <cp> [ <S>? | <S>? <cp> ]+ <S>? ) } rule seq { ( <S>? <cp> [ <S>? , <S>? <cp> ]* <S>? ) } rule Mixed { ( <S>? #PCDATA [<S>? | <S>? <Name>]* <S>? )* | ( <S>? #PCDATA <S>? + ) } rule AttlistDecl { <'<!ATTLIST'> <S> <Name> <AttDef>* <S>? <'>'> } rule AttDef { <S> <Name> <S> <AttType> <S> <DefaultDecl> } rule AttType { <StringType> | <TokenizedType> | <EnumeratedType> } rule StringType { CDATA } rule TokenizedType { ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS } rule EnumeratedType { <NotationType> | <Enumeration> } rule NotationType { NOTATION <S> ( <S>? <Name> [<S>? | <S>? <Name>]* <S>? ) } rule Enumeration { ( <S>? <Nmtoken> [<S>? | <S>? <Nmtoken>]* <S>? ) } rule DefaultDecl { #REQUIRED | #IMPLIED | [[#FIXED <S><AttValue>] } rule conditionalSect { <includeSect> | <ignoreSect> } rule includeSect { <'<!['> <S><extSubsetDecl> <']]>'> } rule ignoreSect { <'<!['> <S><ignoreSectContents>* <']]>'> } rule ignoreSectContents { <Ignore> [<'<!['> <ignoreSectContents> <']]>'> <Ignore>]* } rule Ignore { <Char>* <!after [<Char>* [<'<!['> | <']]>'>] <Char>*]> } rule CharRef { &# <[0-9]>+ ; | &#x <[0-9a-fA-F]>+ ; } rule Reference { <EntityRef> | <CharRef> } rule EntityRef { & <Name> ; } rule PEReference { <'%'> <Name> ; } rule EntityDecl { <GEDecl> | <PEDecl> } rule GEDecl { <'<!ENTITY'> <S> <Name> <S> <EntityDef> <S>? <'>'> } rule PEDecl { <'<!ENTITY'> <S> <'%'> <S> <Name> <S> <PEDef> <S>? <'>'> } rule EntityDef { <EntityValue> | [<ExternalID> <NDataDecl>?] } rule PEDef { <EntityValue> | <ExternalID> } rule ExternalID { SYSTEM <S> <SystemLiteral>| PUBLIC <S> <PubidLiteral> <S> <SystemL +iteral> } rule NDataDecl { <S> NDATA <S> <Name> } rule TextDecl { <'<?xml'> <VersionInfo>? <EncodingDecl> <S>? <'?>'> } rule extParsedEnt { <TextDecl>? <content> } rule extPE { <TextDecl>? <extSubsetDecl> } rule EncodingDecl { <S> encoding <Eq> [" <EncName> " | ' <EncName> ' ] } rule EncName { <[A-Za-z]> [<[A-Za-z0-9._]> | -]* } rule NotationDecl { <'<!NOTATION'> <S> <Name> <S> [<ExternalID> | <PublicID>] <S>? <'> +'> } rule PublicID { PUBLIC <S> <PubidLiteral> } rule Letter { <BaseChar> | <Ideographic> } rule BaseChar { # Large block manually removed. } rule Ideographic { <[\x4E00-\x9FA5]> | \x3007 | <[\x3021-\x3029]> } rule CombiningChar { # Large block manually removed. } rule Digit { # Large block manually removed. } rule Extender { # Large block manually removed. }
I'm not sure about the <!after> parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.

Oh, it's not finished. It needs grammar XML { ... } wrapped around it. I leave that as an excercise for the reader ;)

I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...