Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Perl: the Markov chain saw
 
PerlMonks

Parsing EBNF with Perl 5 to eventually parse XML with Perl 6

by Juerd (Abbot)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Jul 05, 2002 at 22:20 UTC ( #179755=perlmeditation: print w/ replies, xml ) Need Help??

Updates - several code changes. There are probably more bugs to be fixed... (Geez, I'm fixing bugs in code that can't even be executed? :)

After reading Parsing with Perl 6, I wanted to write Perl 6 regexes too. Just for fun, and to get used to the new powerful regex language.

I once wrote a very simple xml-ish well formedness checking regex, but it uses PCRE with a patch, so it isn't really useful. Let's turn it into something even less useful, I thought. So I translated it to Perl 6:

rule xml :i { ^ \s* [ # Single tags like <foo/> \< \s* <[a-z:]>+ (?: \s*<[a-z:]>* \s* = \s* (?:' <[^']>* ' | " <[^"]>* ") )* \s* /\s* \> | # Tags in pairs like <foo>content</foo> \< \s* $1 := (<[a-z:]>+) [ \s*<[a-z:]>* \s* = \s* [ ' <[^']>* ' | " <[^"]>* " ] ]* \s* \> [ <[^<>]>* | <xml> ]* \< \s* / \s* $1 \s* \> ] \s* }
Now that was child's play. It's only a few bytes different from the piece of code I wrote before.

Then I thought that it would be nice to use the new powerful regexes to create a real XML grammar, so I checked out the specs. I found them at http://www.w3.org/TR/2000/REC-xml-20001006, and it has grammar in Extended Backus-Naur Form (name ::= definition). I started to translate:

document ::= prolog element Misc* Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [# +x10000-#x10FFFF] S ::= (#x20 | #x9 | #xD | #xA)+ etcetera, etcetera
into:
rule document { <prolog> <element> <Misc>* } rule Char { <[\x9\xA\xD\x20-\xD7FF\xE000\xFFFD\x10000-\x10FFFF]> } rule S { <[\x20\x9\xD\xA]>+ } rule NameChar { <Letter> | <Digit> | <[._:-]> | <CombiningChar> | <Ext +ender> } rule Name { [ <Letter | <[_:]> ] <NameChar>* } rule Names { <Name> [<S> <Name>]* } rule Nmtoken { <NameChar>+ } rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* } etcetera, etcetera

Darn it, the new regex language is too powerful. All I was doing was translating the EBNF to Perl6 regexes, without making real changes. That's no fun. So I created a Perl (5) script to do the work for me, using the available XML version of the document:

#!/usr/bin/perl -w use strict; use XML::TreeBuilder; my $tree = XML::TreeBuilder->new; $tree->parse_file('rec'); # http://www.w3.org/TR/2000/REC-xml-20001006.xml for ($tree->look_down(_tag => 'prod')) { print 'rule '; print $_->look_down(_tag => 'lhs')->content_list; print " {\n "; my $not = 0; my $p = 1; my @foo = $_->look_down(_tag => 'rhs'); for (map { $_->content_list } @foo) { if (ref) { if (my $t = $_->look_down(_tag => 'nt')) { print '<', $t->content_list, '>'; } $not = 0, print '>' if $not and $p == $not; next; } { s/\xC2\xA0/ /g; # Non breaking space s/\s+/ /g; s/^S//g and do { print '<S>'; $not = 0, print '>' if $not and $p == $not; next; }; s/^'([^']*)'// || s/^"([^"]*)"// and do { my $m = $1; if ($m =~ /[\@\$%<>:\\.]/) { $m =~ s/\\/\\\\/g; $m =~ s/'/\\'/g; print "<'$m'>"; } else { print $m; } $not = 0, print '>' if $not and $p == $not; redo; }; s/^\[([^]]*)\]// and do { my $m = $1; $m =~ s/#x/\\x/g; my $neg = (($m =~ s/^\^//) ? '-' : ''); print "<$neg\[$m]>"; $not = 0, print '>' if $not and $p == $not; redo; }; s/^#x([0-9A-Fa-f]+)// and do { print "\\x$1"; $not = 0, print '>' if $not and $p == $not; redo; }; s/^-// and do { $not = $p; print '<!after'; redo; }; s/^\(// and do { print '['; ++$p; redo; }; s/^\)// and do { print ']'; --$p; $not = 0, print '>' if $not and $p == $not; redo; }; s/^[|*+? ]+// and do { (my $x = $&) =~ tr/()/[]/; print $x; redo; }; next if not length; print "ERROR: $_\n"; } } print "\n}\n\n"; }
Okay, let's see: there's a tree that is walked, printing the new version on the fly. This is done in an ugly way, but I think that's allowed for quick hacks.

And here is what it magically spit out:

rule document { <prolog> <element> <Misc>* } rule Char { \x9 | \xA | \xD | <[\x20-\xD7FF]> | <[\xE000-\xFFFD]> | <[\x10000- +\x10FFFF]> } rule S { [\x20 | \x9 | \xD | \xA]+ } rule NameChar { <Letter> | <Digit> | <'.'> | - | _ | <':'> | <CombiningChar> | <Ex +tender> } rule Name { [<Letter> | _ | <':'>] [<NameChar>]* } rule Names { <Name> [<S> <Name>]* } rule Nmtoken { [<NameChar>]+ } rule Nmtokens { <Nmtoken> [<S> <Nmtoken>]* } rule EntityValue { " [<-[%&"]> | <PEReference> | <Reference>]* " | ' [<-[%&']> | <PER +eference> | <Reference>]* ' } rule AttValue { " [<-[<&"]> | <Reference>]* " | ' [<-[<&']> | <Reference>]* ' } rule SystemLiteral { [" <-["]>* "] | [' <-[']>* '] } rule PubidLiteral { " <PubidChar>* " | ' [<PubidChar> <!after '>]* ' } rule PubidChar { \x20 | \xD | \xA | <[a-zA-Z0-9]> | <[-'()+,./:=?;!*#@$_%]> } rule CharData { <-[<&]>* <!after [<-[<&]>* <']]>'> <-[<&]>*]> } rule Comment { <'<!--'> [[<Char> <!after ->] | [- [<Char> <!after ->]]]* <'-->'> } rule PI { <'<?'> <PITarget> [<S> [<Char>* <!after [<Char>* <'?>'> <Char>*]>] +]? <'?>'> } rule PITarget { <Name> <!after [[X | x] [M | m] [L | l]]> } rule CDSect { <CDStart> <CData> <CDEnd> } rule CDStart { <'<![CDATA['> } rule CData { [<Char>* <!after [<Char>* <']]>'> <Char>*]>] } rule CDEnd { <']]>'> } rule prolog { <XMLDecl>? <Misc>* [<doctypedecl> <Misc>*]? } rule XMLDecl { <'<?xml'> <VersionInfo> <EncodingDecl>? <SDDecl>? <S>? <'?>'> } rule VersionInfo { <S> version <Eq> [' <VersionNum> ' | " <VersionNum> "] } rule Eq { <S>? = <S>? } rule VersionNum { [<[a-zA-Z0-9_.:]> | -]+ } rule Misc { <Comment> | <PI> | <S> } rule doctypedecl { <'<!DOCTYPE'> <S> <Name> [<S> <ExternalID>]? <S>? [[ [<markupdecl> + | <DeclSep>]* ] <S>?]? <'>'> } rule DeclSep { <PEReference> | <S> } rule markupdecl { <elementdecl> | <AttlistDecl> | <EntityDecl> | <NotationDecl> | <P +I> | <Comment> } rule extSubset { <TextDecl>? <extSubsetDecl> } rule extSubsetDecl { [ <markupdecl> | <conditionalSect> | <DeclSep>]* } rule SDDecl { <S> standalone <Eq> [[' [yes | no] '] | [" [yes | no] "]] } rule LanguageID { <Langcode> [- <Subcode>]* } rule Langcode { <ISO639Code> | <IanaCode> | <UserCode> } rule ISO639Code { [<[a-z]> | <[A-Z]>] [<[a-z]> | <[A-Z]>] } rule IanaCode { [i | I] - [<[a-z]> | <[A-Z]>]+ } rule UserCode { [x | X] - [<[a-z]> | <[A-Z]>]+ } rule Subcode { [<[a-z]> | <[A-Z]>]+ } rule element { <EmptyElemTag>| <STag> <content> <ETag> } rule STag { <'<'> <Name> [<S> <Attribute>]* <S>? <'>'> } rule Attribute { <Name> <Eq> <AttValue> } rule ETag { <'</'> <Name> <S>? <'>'> } rule content { <CharData>? [[<element> | <Reference> | <CDSect> | <PI> | <Comment +>] <CharData>?]* } rule EmptyElemTag { <'<'> <Name> [<S> <Attribute>]* <S>? <'/>'> } rule elementdecl { <'<!ELEMENT'> <S> <Name> <S> <contentspec> <S>? <'>'> } rule contentspec { EMPTY | ANY | <Mixed> | <children> } rule children { [<choice> | <seq>] [? | * | +]? } rule cp { [<Name> | <choice> | <seq>] [? | * | +]? } rule choice { ( <S>? <cp> [ <S>? | <S>? <cp> ]+ <S>? ) } rule seq { ( <S>? <cp> [ <S>? , <S>? <cp> ]* <S>? ) } rule Mixed { ( <S>? #PCDATA [<S>? | <S>? <Name>]* <S>? )* | ( <S>? #PCDATA <S>? + ) } rule AttlistDecl { <'<!ATTLIST'> <S> <Name> <AttDef>* <S>? <'>'> } rule AttDef { <S> <Name> <S> <AttType> <S> <DefaultDecl> } rule AttType { <StringType> | <TokenizedType> | <EnumeratedType> } rule StringType { CDATA } rule TokenizedType { ID| IDREF| IDREFS| ENTITY| ENTITIES| NMTOKEN| NMTOKENS } rule EnumeratedType { <NotationType> | <Enumeration> } rule NotationType { NOTATION <S> ( <S>? <Name> [<S>? | <S>? <Name>]* <S>? ) } rule Enumeration { ( <S>? <Nmtoken> [<S>? | <S>? <Nmtoken>]* <S>? ) } rule DefaultDecl { #REQUIRED | #IMPLIED | [[#FIXED <S><AttValue>] } rule conditionalSect { <includeSect> | <ignoreSect> } rule includeSect { <'<!['> <S><extSubsetDecl> <']]>'> } rule ignoreSect { <'<!['> <S><ignoreSectContents>* <']]>'> } rule ignoreSectContents { <Ignore> [<'<!['> <ignoreSectContents> <']]>'> <Ignore>]* } rule Ignore { <Char>* <!after [<Char>* [<'<!['> | <']]>'>] <Char>*]> } rule CharRef { &# <[0-9]>+ ; | &#x <[0-9a-fA-F]>+ ; } rule Reference { <EntityRef> | <CharRef> } rule EntityRef { & <Name> ; } rule PEReference { <'%'> <Name> ; } rule EntityDecl { <GEDecl> | <PEDecl> } rule GEDecl { <'<!ENTITY'> <S> <Name> <S> <EntityDef> <S>? <'>'> } rule PEDecl { <'<!ENTITY'> <S> <'%'> <S> <Name> <S> <PEDef> <S>? <'>'> } rule EntityDef { <EntityValue> | [<ExternalID> <NDataDecl>?] } rule PEDef { <EntityValue> | <ExternalID> } rule ExternalID { SYSTEM <S> <SystemLiteral>| PUBLIC <S> <PubidLiteral> <S> <SystemL +iteral> } rule NDataDecl { <S> NDATA <S> <Name> } rule TextDecl { <'<?xml'> <VersionInfo>? <EncodingDecl> <S>? <'?>'> } rule extParsedEnt { <TextDecl>? <content> } rule extPE { <TextDecl>? <extSubsetDecl> } rule EncodingDecl { <S> encoding <Eq> [" <EncName> " | ' <EncName> ' ] } rule EncName { <[A-Za-z]> [<[A-Za-z0-9._]> | -]* } rule NotationDecl { <'<!NOTATION'> <S> <Name> <S> [<ExternalID> | <PublicID>] <S>? <'> +'> } rule PublicID { PUBLIC <S> <PubidLiteral> } rule Letter { <BaseChar> | <Ideographic> } rule BaseChar { # Large block manually removed. } rule Ideographic { <[\x4E00-\x9FA5]> | \x3007 | <[\x3021-\x3029]> } rule CombiningChar { # Large block manually removed. } rule Digit { # Large block manually removed. } rule Extender { # Large block manually removed. }
I'm not sure about the <!after> parts, and of course was not able to test any of the grammar output, but I do think it's nice to have XML grammar readily available this easy.

Oh, it's not finished. It needs grammar XML { ... } wrapped around it. I leave that as an excercise for the reader ;)

I wonder if in the future Perl 6 regexes are going to be used instead of EBNF...

Comment on Parsing EBNF with Perl 5 to eventually parse XML with Perl 6
Select or Download Code
Re: Parsing EBNF to eventually parse XML with Perl 6
by vladb (Vicar) on Jul 05, 2002 at 23:54 UTC
    Fascinating post, Juerd++.

    Perl 6 regexp indeed seems impressive. If it was up to me to decide, I guess I'd surely switch from using EBNF to Perl 6 regexp rules. {grin} As you've already mentioned, they bear a lot of similiarity. However, I prefer Perl's way as it more closely resembles language structure.

    _____________________
    # Under Construction
[reply]
[d/l]
Re: Parsing EBNF with Perl 5 to eventually parse XML with Perl 6
by Matts (Deacon) on Jul 08, 2002 at 11:25 UTC
    Very interesting.

    The only thing to watch out for is that XML can't be parsed using the grammar alone. The comments in the text of the spec are normative, not the grammar. If that wasn't the case I would have used one of the grammar generators (like Parse::YAPP, which can produce a standalone grammar) for XML::SAX::PurePerl.

    Still, good effort. If perl 6 ever gets these regexps actually implemented (I have my doubts about that) then it's certainly going to be interesting writing parsers.

[reply]
      Well, doubt no more, PGE has them implemented (as part of Parrot), and you can run them with Pugs today. :-)
[reply]
Re: Parsing EBNF with Perl 5 to eventually parse XML with Perl 6
by Juerd (Abbot) on Aug 05, 2002 at 16:09 UTC

    I'm not sure about the <!after> parts

    I am now. All foo <!after bar> have to be <!before bar> foo. But I have no idea how the existing script can be modified to do this. Order needs to be changed and the groups of nodes make it hard to do. If you think you can fix it, please do and post your fix.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

[reply]
[d/l]
[select]

Back to Meditations


Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: perlmeditation [id://179755]
Approved by FoxtrotUniform
Front-paged by cjf
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others romping around the Monastery: (10)
ikegami
GrandFather
atcroft
herveus
thezip
Eyck
ree
ssandv
gnosti
brap
As of 2009-11-21 04:34 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (726 votes), past polls