Now that was child's play. It's only a few bytes different from the piece of code I wrote before.
Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at http://www.w3.org/TR/2000/REC-xml-20001006, and it has grammar in Extended Backus-Naur Form (name ::= definition). I started to translate:
document ::= prolog element Misc*
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#
+x10000-#x10FFFF]
S ::= (#x20 | #x9 | #xD | #xA)+
etcetera, etcetera
into:
rule document { <prolog> <element> <Misc>* }
rule Char { <[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]> }
rule S { <[\x20\x9\xD\xA]>+ }
rule NameChar { <Letter> | <Digit> | <[._:-]> | <CombiningChar> | <Ext
+ender> }
rule Name { [ <Letter | <[_:]> ] <NameChar>* }
rule Names { <Name> [<S> <Name>]* }
rule Nmtoken { <NameChar>+ }
rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* }
etcetera, etcetera
Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:
#!/usr/bin/perl -w
use strict;
use XML::TreeBuilder;
my $tree = XML::TreeBuilder->new;
$tree->parse_file('rec');
# http://www.w3.org/TR/2000/REC-xml-20001006.xml
for ($tree->look_down(_tag => 'prod')) {
print 'rule ';
print $_->look_down(_tag => 'lhs')->content_list;
print " {\n ";
my $not = 0;
my $p = 1;
my @foo = $_->look_down(_tag => 'rhs');
for (map { $_->content_list } @foo) {
if (ref) {
if (my $t = $_->look_down(_tag => 'nt')) {
print '<', $t->content_list, '>';
}
$not = 0, print '>' if $not and $p == $not;
next;
}
{
s/\xC2\xA0/ /g; # Non breaking space
s/\s+/ /g;
s/^S//g and do {
print '<S>';
$not = 0, print '>' if $not and $p == $not;
next;
};
s/^'([^']*)'// || s/^"([^"]*)"// and do {
my $m = $1;
if ($m =~ /[\@\$%<>:\\.]/) {
$m =~ s/\\/\\\\/g;
$m =~ s/'/\\'/g;
print "<'$m'>";
} else {
print $m;
}
$not = 0, print '>' if $not and $p == $not;
redo;
};
s/^\[([^]]*)\]// and do {
my $m = $1;
$m =~ s/#x/\\x/g;
my $neg = (($m =~ s/^\^//) ? '-' : '');
print "<$neg\[$m]>";
$not = 0, print '>' if $not and $p == $not;
redo;
};
s/^#x([0-9A-Fa-f]+)// and do {
print "\\x$1";
$not = 0, print '>' if $not and $p == $not;
redo;
};
s/^-// and do {
$not = $p;
print '<!after';
redo;
};
s/^\(// and do {
print '[';
++$p;
redo;
};
s/^\)// and do {
print ']';
--$p;
$not = 0, print '>' if $not and $p == $not;
redo;
};
s/^[|*+? ]+// and do {
(my $x = $&) =~ tr/()/[]/;
print $x;
redo;
};
next if not length;
print "ERROR: $_\n";
}
}
print "\n}\n\n";
}
Okay, let's see: there's a tree that is walked, printing the new version on the fly. This is done in an ugly way, but I think that's allowed for quick hacks.
And here is what it magically spit out:
rule document {
<prolog> <element> <Misc>*
}
rule Char {
\x9 | \xA | \xD | <[\x20-\xD7FF]> | <[\xE000-\xFFFD]> | <[\x10000-
+\x10FFFF]>
}
rule S {
[\x20 | \x9 | \xD | \xA]+
}
rule NameChar {
<Letter> | <Digit> | <'.'> | - | _ | <':'> | <CombiningChar> | <Ex
+tender>
}
rule Name {
[<Letter> | _ | <':'>] [<NameChar>]*
}
rule Names {
<Name> [<S> <Name>]*
}
rule Nmtoken {
[<NameChar>]+
}
rule Nmtokens {
<Nmtoken> [<S> <Nmtoken>]*
}
rule EntityValue {
" [<-[%&"]> | <PEReference> | <Reference>]* " | ' [<-[%&']> | <PER
+eference> | <Reference>]* '
}
rule AttValue {
" [<-[<&"]> | <Reference>]* " | ' [<-[<&']> | <Reference>]* '
}
rule SystemLiteral {
[" <-["]>* "] | [' <-[']>* ']
}
rule PubidLiteral {
" <PubidChar>* " | ' [<PubidChar> <!after '>]* '
}
rule PubidChar {
\x20 | \xD | \xA | <[a-zA-Z0-9]> | <[-'()+,./:=?;!*#@$_%]>
}
rule CharData {
<-[<&]>* <!after [<-[<&]>* <']]>'> <-[<&]>*]>
}
rule Comment {
<'<!--'> [[<Char> <!after ->] | [- [<Char> <!after ->]]]* <'-->'>
}
rule PI {
<'<?'> <PITarget> [<S> [<Char>* <!after [<Char>* <'?>'> <Char>*]>]
+]? <'?>'>
}
rule PITarget {
<Name> <!after [[X | x] [M | m] [L | l]]>
}
rule CDSect {
<CDStart> <CData> <CDEnd>
}
rule CDStart {
<'<![CDATA['>
}
rule CData {
[<Char>* <!after [<Char>* <']]>'> <Char>*]>]
}
rule CDEnd {
<']]>'>
}
rule prolog {
<XMLDecl>? <Misc>* [<doctypedecl> <Misc>*]?
}
rule XMLDecl {
<'<?xml'> <VersionInfo> <EncodingDecl>? <SDDecl>? <S>? <'?>'>
}
rule VersionInfo {
<S> version <Eq> [' <VersionNum> ' | " <VersionNum> "]
}
rule Eq {
<S>? = <S>?
}
rule VersionNum {
[<[a-zA-Z0-9_.:]> | -]+
}
rule Misc {
<Comment> | <PI> | <S>
}
rule doctypedecl {
<'<!DOCTYPE'> <S> <Name> [<S> <ExternalID>]? <S>? [[ [<markupdecl>
+ | <DeclSep>]* ] <S>?]? <'>'>
}
rule DeclSep {
<PEReference> | <S>
}
rule markupdecl {
<elementdecl> | <AttlistDecl> | <EntityDecl> | <NotationDecl> | <P
+I> | <Comment>
}
rule extSubset {
<TextDecl>? <extSubsetDecl>
}
rule extSubsetDecl {
[ <markupdecl> | <conditionalSect> | <DeclSep>]*
}
rule SDDecl {
<S> standalone <Eq> [[' [yes | no] '] | [" [yes | no] "]]
}
rule LanguageID {
<Langcode> [- <Subcode>]*
}
rule Langcode {
<ISO639Code> | <IanaCode> | <UserCode>
}
rule ISO639Code {
[<[a-z]> | <[A-Z]>] [<[a-z]> | <[A-Z]>]
}
rule IanaCode {
[i | I] - [<[a-z]> | <[A-Z]>]+
}
rule UserCode {
[x | X] - [<[a-z]> | <[A-Z]>]+
}
rule Subcode {
[<[a-z]> | <[A-Z]>]+
}
rule element {
<EmptyElemTag>| <STag> <content> <ETag>
}
rule STag {
<'<'> <Name> [<S> <Attribute>]* <S>? <'>'>
}
rule Attribute {
<Name> <Eq> <AttValue>
}
rule ETag {
<'</'> <Name> <S>? <'>'>
}
rule content {
<CharData>? [[<element> | <Reference> | <CDSect> | <PI> | <Comment
+>] <CharData>?]*
}
rule EmptyElemTag {
<'<'> <Name> [<S> <Attribute>]* <S>? <'/>'>
}
rule elementdecl {
<'<!ELEMENT'> <S> <Name> <S> <contentspec> <S>? <'>'>
}
rule contentspec {
EMPTY | ANY | <Mixed> | <children>
}
rule children {
[<choice> | <seq>] [? | * | +]?
}
rule cp {
[<Name> | <choice> | <seq>] [? | * | +]?
}
rule choice {
( <S>? <cp> [ <S>? | <S>? <cp> ]+ <S>? )
}
rule seq {
( <S>? <cp> [ <S>? , <S>? <cp> ]* <S>? )
}
rule Mixed {
( <S>? #PCDATA [<S>? | <S>? <Name>]* <S>? )* | ( <S>? #PCDATA <S>?
+ )
}
rule AttlistDecl {
<'<!ATTLIST'> <S> <Name> <AttDef>* <S>? <'>'>
}
rule AttDef {
<S> <Name> <S> <AttType> <S> <DefaultDecl>
}
rule AttType {
<StringType> | <TokenizedType> | <EnumeratedType>
}
rule StringType {
CDATA
}
rule TokenizedType {
ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS
}
rule EnumeratedType {
<NotationType> | <Enumeration>
}
rule NotationType {
NOTATION <S> ( <S>? <Name> [<S>? | <S>? <Name>]* <S>? )
}
rule Enumeration {
( <S>? <Nmtoken> [<S>? | <S>? <Nmtoken>]* <S>? )
}
rule DefaultDecl {
#REQUIRED | #IMPLIED | [[#FIXED <S><AttValue>]
}
rule conditionalSect {
<includeSect> | <ignoreSect>
}
rule includeSect {
<'<!['> <S><extSubsetDecl> <']]>'>
}
rule ignoreSect {
<'<!['> <S><ignoreSectContents>* <']]>'>
}
rule ignoreSectContents {
<Ignore> [<'<!['> <ignoreSectContents> <']]>'> <Ignore>]*
}
rule Ignore {
<Char>* <!after [<Char>* [<'<!['> | <']]>'>] <Char>*]>
}
rule CharRef {
&# <[0-9]>+ ; | &#x <[0-9a-fA-F]>+ ;
}
rule Reference {
<EntityRef> | <CharRef>
}
rule EntityRef {
& <Name> ;
}
rule PEReference {
<'%'> <Name> ;
}
rule EntityDecl {
<GEDecl> | <PEDecl>
}
rule GEDecl {
<'<!ENTITY'> <S> <Name> <S> <EntityDef> <S>? <'>'>
}
rule PEDecl {
<'<!ENTITY'> <S> <'%'> <S> <Name> <S> <PEDef> <S>? <'>'>
}
rule EntityDef {
<EntityValue> | [<ExternalID> <NDataDecl>?]
}
rule PEDef {
<EntityValue> | <ExternalID>
}
rule ExternalID {
SYSTEM <S> <SystemLiteral>| PUBLIC <S> <PubidLiteral> <S> <SystemL
+iteral>
}
rule NDataDecl {
<S> NDATA <S> <Name>
}
rule TextDecl {
<'<?xml'> <VersionInfo>? <EncodingDecl> <S>? <'?>'>
}
rule extParsedEnt {
<TextDecl>? <content>
}
rule extPE {
<TextDecl>? <extSubsetDecl>
}
rule EncodingDecl {
<S> encoding <Eq> [" <EncName> " | ' <EncName> ' ]
}
rule EncName {
<[A-Za-z]> [<[A-Za-z0-9._]> | -]*
}
rule NotationDecl {
<'<!NOTATION'> <S> <Name> <S> [<ExternalID> | <PublicID>] <S>? <'>
+'>
}
rule PublicID {
PUBLIC <S> <PubidLiteral>
}
rule Letter {
<BaseChar> | <Ideographic>
}
rule BaseChar {
# Large block manually removed.
}
rule Ideographic {
<[\x4E00-\x9FA5]> | \x3007 | <[\x3021-\x3029]>
}
rule CombiningChar {
# Large block manually removed.
}
rule Digit {
# Large block manually removed.
}
rule Extender {
# Large block manually removed.
}
I'm not sure about the
<!after> parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.
Oh, it's not finished. It needs grammar XML { ... } wrapped around it. I leave that as an excercise for the reader ;)
I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...