Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Welcome to the Monastery
 
PerlMonks  

Parsing EBNF with Perl 5 to eventually parse XML with Perl 6

by Juerd (Abbot)
on Jul 05, 2002 at 22:20 UTC ( #179755=perlmeditation: print w/ replies, xml ) Need Help??

Updates - several code changes. There are probably more bugs to be fixed... (Geez, I'm fixing bugs in code that can't even be executed? :)

After reading Parsing with Perl 6, I wanted to write Perl 6 regexes too. Just for fun, and to get used to the new powerful regex language.

I once wrote a very simple xml-ish well formedness checking regex, but it uses PCRE with a patch, so it isn't really useful. Let's turn it into something even less useful, I thought. So I translated it to Perl 6:

rule xml :i { ^ \s* [ # Single tags like <foo/> \< \s* <[a-z:]>+ (?: \s*<[a-z:]>* \s* = \s* (?:' <[^']>* ' | " <[^"]>* ") )* \s* /\s* \> | # Tags in pairs like <foo>content</foo> \< \s* $1 := (<[a-z:]>+) [ \s*<[a-z:]>* \s* = \s* [ ' <[^']>* ' | " <[^"]>* " ] ]* \s* \> [ <[^<>]>* | <xml> ]* \< \s* / \s* $1 \s* \> ] \s* }
Now that was child's play. It's only a few bytes different from the piece of code I wrote before.

Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at http://www.w3.org/TR/2000/REC-xml-20001006, and it has grammar in Extended Backus-Naur Form (name ::= definition). I started to translate:

document ::= prolog element Misc* Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [# +x10000-#x10FFFF] S ::= (#x20 | #x9 | #xD | #xA)+ etcetera, etcetera
into:
rule document { <prolog> <element> <Misc>* } rule Char { <[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]> } rule S { <[\x20\x9\xD\xA]>+ } rule NameChar { <Letter> | <Digit> | <[._:-]> | <CombiningChar> | <Ext +ender> } rule Name { [ <Letter | <[_:]> ] <NameChar>* } rule Names { <Name> [<S> <Name>]* } rule Nmtoken { <NameChar>+ } rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* } etcetera, etcetera

Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:

#!/usr/bin/perl -w use strict; use XML::TreeBuilder; my $tree = XML::TreeBuilder->new; $tree->parse_file('rec'); # http://www.w3.org/TR/2000/REC-xml-20001006.xml for ($tree->look_down(_tag => 'prod')) { print 'rule '; print $_->look_down(_tag => 'lhs')->content_list; print " {\n "; my $not = 0; my $p = 1; my @foo = $_->look_down(_tag => 'rhs'); for (map { $_->content_list } @foo) { if (ref) { if (my $t = $_->look_down(_tag => 'nt')) { print '<', $t->content_list, '>'; } $not = 0, print '>' if $not and $p == $not; next; } { s/\xC2\xA0/ /g; # Non breaking space s/\s+/ /g; s/^S//g and do { print '<S>'; $not = 0, print '>' if $not and $p == $not; next; }; s/^'([^']*)'// || s/^"([^"]*)"// and do { my $m = $1; if ($m =~ /[\@\$%<>:\\.]/) { $m =~ s/\\/\\\\/g; $m =~ s/'/\\'/g; print "<'$m'>"; } else { print $m; } $not = 0, print '>' if $not and $p == $not; redo; }; s/^\[([^]]*)\]// and do { my $m = $1; $m =~ s/#x/\\x/g; my $neg = (($m =~ s/^\^//) ? '-' : ''); print "<$neg\[$m]>"; $not = 0, print '>' if $not and $p == $not; redo; }; s/^#x([0-9A-Fa-f]+)// and do { print "\\x$1"; $not = 0, print '>' if $not and $p == $not; redo; }; s/^-// and do { $not = $p; print '<!after'; redo; }; s/^\(// and do { print '['; ++$p; redo; }; s/^\)// and do { print ']'; --$p; $not = 0, print '>' if $not and $p == $not; redo; }; s/^[|*+? ]+// and do { (my $x = $&) =~ tr/()/[]/; print $x; redo; }; next if not length; print "ERROR: $_\n"; } } print "\n}\n\n"; }
Okay, let's see: there's a tree that is walked, printing the new version on the fly. This is done in an ugly way, but I think that's allowed for quick hacks.

And here is what it magically spit out:

rule document { <prolog> <element> <Misc>* } rule Char { \x9 | \xA | \xD | <[\x20-\xD7FF]> | <[\xE000-\xFFFD]> | <[\x10000- +\x10FFFF]> } rule S { [\x20 | \x9 | \xD | \xA]+ } rule NameChar { <Letter> | <Digit> | <'.'> | - | _ | <':'> | <CombiningChar> | <Ex +tender> } rule Name { [<Letter> | _ | <':'>] [<NameChar>]* } rule Names { <Name> [<S> <Name>]* } rule Nmtoken { [<NameChar>]+ } rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* } rule EntityValue { " [<-[%&"]> | <PEReference> | <Reference>]* " | ' [<-[%&']> | <PER +eference> | <Reference>]* ' } rule AttValue { " [<-[<&"]> | <Reference>]* " | ' [<-[<&']> | <Reference>]* ' } rule SystemLiteral { [" <-["]>* "] | [' <-[']>* '] } rule PubidLiteral { " <PubidChar>* " | ' [<PubidChar> <!after '>]* ' } rule PubidChar { \x20 | \xD | \xA | <[a-zA-Z0-9]> | <[-'()+,./:=?;!*#@$_%]> } rule CharData { <-[<&]>* <!after [<-[<&]>* <']]>'> <-[<&]>*]> } rule Comment { <'<!--'> [[<Char> <!after ->] | [- [<Char> <!after ->]]]* <'-->'> } rule PI { <'<?'> <PITarget> [<S> [<Char>* <!after [<Char>* <'?>'> <Char>*]>] +]? <'?>'> } rule PITarget { <Name> <!after [[X | x] [M | m] [L | l]]> } rule CDSect { <CDStart> <CData> <CDEnd> } rule CDStart { <'<![CDATA['> } rule CData { [<Char>* <!after [<Char>* <']]>'> <Char>*]>] } rule CDEnd { <']]>'> } rule prolog { <XMLDecl>? <Misc>* [<doctypedecl> <Misc>*]? } rule XMLDecl { <'<?xml'> <VersionInfo> <EncodingDecl>? <SDDecl>? <S>? <'?>'> } rule VersionInfo { <S> version <Eq> [' <VersionNum> ' | " <VersionNum> "] } rule Eq { <S>? = <S>? } rule VersionNum { [<[a-zA-Z0-9_.:]> | -]+ } rule Misc { <Comment> | <PI> | <S> } rule doctypedecl { <'<!DOCTYPE'> <S> <Name> [<S> <ExternalID>]? <S>? [[ [<markupdecl> + | <DeclSep>]* ] <S>?]? <'>'> } rule DeclSep { <PEReference> | <S> } rule markupdecl { <elementdecl> | <AttlistDecl> | <EntityDecl> | <NotationDecl> | <P +I> | <Comment> } rule extSubset { <TextDecl>? <extSubsetDecl> } rule extSubsetDecl { [ <markupdecl> | <conditionalSect> | <DeclSep>]* } rule SDDecl { <S> standalone <Eq> [[' [yes | no] '] | [" [yes | no] "]] } rule LanguageID { <Langcode> [- <Subcode>]* } rule Langcode { <ISO639Code> | <IanaCode> | <UserCode> } rule ISO639Code { [<[a-z]> | <[A-Z]>] [<[a-z]> | <[A-Z]>] } rule IanaCode { [i | I] - [<[a-z]> | <[A-Z]>]+ } rule UserCode { [x | X] - [<[a-z]> | <[A-Z]>]+ } rule Subcode { [<[a-z]> | <[A-Z]>]+ } rule element { <EmptyElemTag>| <STag> <content> <ETag> } rule STag { <'<'> <Name> [<S> <Attribute>]* <S>? <'>'> } rule Attribute { <Name> <Eq> <AttValue> } rule ETag { <'</'> <Name> <S>? <'>'> } rule content { <CharData>? [[<element> | <Reference> | <CDSect> | <PI> | <Comment +>] <CharData>?]* } rule EmptyElemTag { <'<'> <Name> [<S> <Attribute>]* <S>? <'/>'> } rule elementdecl { <'<!ELEMENT'> <S> <Name> <S> <contentspec> <S>? <'>'> } rule contentspec { EMPTY | ANY | <Mixed> | <children> } rule children { [<choice> | <seq>] [? | * | +]? } rule cp { [<Name> | <choice> | <seq>] [? | * | +]? } rule choice { ( <S>? <cp> [ <S>? | <S>? <cp> ]+ <S>? ) } rule seq { ( <S>? <cp> [ <S>? , <S>? <cp> ]* <S>? ) } rule Mixed { ( <S>? #PCDATA [<S>? | <S>? <Name>]* <S>? )* | ( <S>? #PCDATA <S>? + ) } rule AttlistDecl { <'<!ATTLIST'> <S> <Name> <AttDef>* <S>? <'>'> } rule AttDef { <S> <Name> <S> <AttType> <S> <DefaultDecl> } rule AttType { <StringType> | <TokenizedType> | <EnumeratedType> } rule StringType { CDATA } rule TokenizedType { ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS } rule EnumeratedType { <NotationType> | <Enumeration> } rule NotationType { NOTATION <S> ( <S>? <Name> [<S>? | <S>? <Name>]* <S>? ) } rule Enumeration { ( <S>? <Nmtoken> [<S>? | <S>? <Nmtoken>]* <S>? ) } rule DefaultDecl { #REQUIRED | #IMPLIED | [[#FIXED <S><AttValue>] } rule conditionalSect { <includeSect> | <ignoreSect> } rule includeSect { <'<!['> <S><extSubsetDecl> <']]>'> } rule ignoreSect { <'<!['> <S><ignoreSectContents>* <']]>'> } rule ignoreSectContents { <Ignore> [<'<!['> <ignoreSectContents> <']]>'> <Ignore>]* } rule Ignore { <Char>* <!after [<Char>* [<'<!['> | <']]>'>] <Char>*]> } rule CharRef { &# <[0-9]>+ ; | &#x <[0-9a-fA-F]>+ ; } rule Reference { <EntityRef> | <CharRef> } rule EntityRef { & <Name> ; } rule PEReference { <'%'> <Name> ; } rule EntityDecl { <GEDecl> | <PEDecl> } rule GEDecl { <'<!ENTITY'> <S> <Name> <S> <EntityDef> <S>? <'>'> } rule PEDecl { <'<!ENTITY'> <S> <'%'> <S> <Name> <S> <PEDef> <S>? <'>'> } rule EntityDef { <EntityValue> | [<ExternalID> <NDataDecl>?] } rule PEDef { <EntityValue> | <ExternalID> } rule ExternalID { SYSTEM <S> <SystemLiteral>| PUBLIC <S> <PubidLiteral> <S> <SystemL +iteral> } rule NDataDecl { <S> NDATA <S> <Name> } rule TextDecl { <'<?xml'> <VersionInfo>? <EncodingDecl> <S>? <'?>'> } rule extParsedEnt { <TextDecl>? <content> } rule extPE { <TextDecl>? <extSubsetDecl> } rule EncodingDecl { <S> encoding <Eq> [" <EncName> " | ' <EncName> ' ] } rule EncName { <[A-Za-z]> [<[A-Za-z0-9._]> | -]* } rule NotationDecl { <'<!NOTATION'> <S> <Name> <S> [<ExternalID> | <PublicID>] <S>? <'> +'> } rule PublicID { PUBLIC <S> <PubidLiteral> } rule Letter { <BaseChar> | <Ideographic> } rule BaseChar { # Large block manually removed. } rule Ideographic { <[\x4E00-\x9FA5]> | \x3007 | <[\x3021-\x3029]> } rule CombiningChar { # Large block manually removed. } rule Digit { # Large block manually removed. } rule Extender { # Large block manually removed. }
I'm not sure about the <!after> parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.

Oh, it's not finished. It needs grammar XML { ... } wrapped around it. I leave that as an excercise for the reader ;)

I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...

Comment on Parsing EBNF with Perl 5 to eventually parse XML with Perl 6
Select or Download Code
Re: Parsing EBNF to eventually parse XML with Perl 6
by vladb (Vicar) on Jul 05, 2002 at 23:54 UTC
    Fascinating post, Juerd++.

    Perl 6 regexp indeed seems impressive. If it was up to me to decide, I guess I'd surely switch from using EBNF to Perl 6 regexp rules. {grin} As you've already mentioned, they bear a lot of similiarity. However, I prefer Perl's way as it more closely resembles language structure.

    _____________________
    # Under Construction
Re: Parsing EBNF with Perl 5 to eventually parse XML with Perl 6
by Matts (Deacon) on Jul 08, 2002 at 11:25 UTC
    Very interesting.

    The only thing to watch out for is that XML can't be parsed using the grammar alone. The comments in the text of the spec are normative, not the grammar. If that wasn't the case I would have used one of the grammar generators (like Parse::YAPP, which can produce a standalone grammar) for XML::SAX::PurePerl.

    Still, good effort. If perl 6 ever gets these regexps actually implemented (I have my doubts about that) then it's certainly going to be interesting writing parsers.

      Well, doubt no more, PGE has them implemented (as part of Parrot), and you can run them with Pugs today. :-)
Re: Parsing EBNF with Perl 5 to eventually parse XML with Perl 6
by Juerd (Abbot) on Aug 05, 2002 at 16:09 UTC

    I'm not sure about the <!after> parts

    I am now. All foo <!after bar> have to be <!before bar> foo. But I have no idea how the existing script can be modified to do this. Order needs to be changed and the groups of nodes make it hard to do. If you think you can fix it, please do and post your fix.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://179755]
Approved by FoxtrotUniform
Front-paged by cjf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2014-04-18 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (461 votes), past polls