Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Embedding a mini-language for XML construction into Perl

by tmoertel (Chaplain)
on Nov 18, 2005 at 23:46 UTC ( #510000=perlmeditation: print w/ replies, xml ) Need Help??

In this meditation, we will embed a mini-language for building XML documents into Perl. Our goal is to see how much syntax we can remove in pursuit of what Damian Conway calls "sufficiently advanced technologies." We want to make building XML just like writing native Perl:
# html { # head { title { text "Title" } }; # body { # p { class_ "warning"; text "paragraph" } # }; # }

There is nothing particularly novel about this approach, and there are similar libraries for many programming languages. Our implementation, however, will stress Perlishness and simplicity. To eliminate clutter during the meditation, we will not make a module but instead expose the underlying code.

Here is our game plan. We will represent XML documents as trees of nested arrays and then render the trees as XML. A node in our tree will be either text (represented as a string) or an element (represented as a triple of the form [name, attributes, children_nodes]). Attributes will be pairs of the form [name, value]. (We will ignore namespaces, XML declarations, and other aspects of XML generation that don't add much to the meditation.)

To build a document, we will call functions that append elements, attributes, and text to the active node in the tree, redefining the active node in passing:

  • To add an element called name, we will call the function name and pass it a function that will construct its attributes and children.
  • To add an attribute called name, we will call the function name_ and pass it the attribute's value (mnemonic: the _ stands for the equals sign in name="val").
  • To add text, will call the function text and pass it the text.
  • Finally, we will create one additional helper called doc that will create an empty document; we will use it to "root" documents created by the earlier functions.

To make it all seem more natural, we will use the (&) prototype on element-creating functions and doc. This lets us use braces to represent nesting when calling the functions:

# doc { # my_elem { # # children go here # }; # };

Likewise, the attribute-creating functions and text get the ($) prototype. This lets us call them without having to use parentheses:

# my_elem { # text "some text"; # my_attr_ "value"; # };

With the game plan in mind, let's work top down:

our $__frag; # points to fragment under active construction sub doc(&) { my ($content_fn) = @_; local $__frag = [undef,undef,undef]; $content_fn->(); $__frag->[2][0]; } sub _elem { my ($elem_name, $content_fn) = @_; # an element is represented by the triple [name, attrs, children] my $elem = [$elem_name, undef, undef]; do { local $__frag = $elem; $content_fn->() }; push @{$__frag->[2]}, $elem; } sub _attr { my ($attr_name, $val) = @_; push @{$__frag->[1]}, [$attr_name, $val]; } sub text($) { push @{$__frag->[2]}, @_; }
The functions _elem and _attr are helpers used by the following function, which lets us embed a custom XML vocabulary into Perl by creating the appropriate Perl functions for the vocabulary's elements and attributes:
sub define_vocabulary { my ($elems, $attrs) = @_; eval "sub $_(&) { _elem('$_',\@_) }" for @$elems; eval "sub ${_}_(\$) { _attr('$_',\@_) }" for @$attrs; }
We can use the above function, for example, to embed a subset of XHTML into Perl:
BEGIN { define_vocabulary( [qw( html head title body h1 h2 h3 p img br )], [qw( src href class style )] ); }
(The use of BEGIN ensures that the embedded functions' prototypes are established before any remaining code is compiled.)

Let's try out our newly embedded vocabulary by dumping out the internal representation of a simple document:

my $my_doc = doc { html { head { title { text "Title" } }; body { p { class_ "warning"; text "paragraph" } }; } }; use Data::Dumper; $Data::Dumper::Indent = $Data::Dumper::Terse = 1; print Dumper $my_doc; # [ # 'html', # undef, # [ # [ # 'head', # undef, # [ # [ # 'title', # undef, # [ # 'Title' # ] # ] # ] # ], # [ # 'body', # undef, # [ # [ # 'p', # [ # [ # 'class', # 'warning' # ] # ], # [ # 'paragraph' # ] # ] # ] # ] # ] # ]

Good! That's just what we want.

All that is left for us to do is render the internal representation as XML. The simplicity of our internal representation makes this straightforward. Here's a renderer for XML::Writer:
use XML::Writer; sub render_via_xml_writer { my $doc = shift; my $writer = XML::Writer->new(@_); # extra args go to ->new() my $render_fn; $render_fn = sub { my $frag = shift; my ($elem, $attrs, $children) = @$frag; $writer->startTag( $elem, map {@$_} @$attrs ); for (@$children) { ref() ? $render_fn->($_) : $writer->characters($_); } $writer->endTag($elem); }; $render_fn->($doc); $writer->end(); }
Now we can render our earlier document:
render_via_xml_writer( $my_doc, DATA_MODE => 1, UNSAFE => 1 ); # <html> # <head> # <title>Title</title> # </head> # <body> # <p class="warning">paragraph</p> # </body> # </html>
In most cases we will render documents shortly after creating them. We can "huffmanize" this common case with another helper, which supplies the outer doc for us and then renders the resulting tree:
sub render_doc(&) { my $docfn = shift; render_via_xml_writer( doc( \&$docfn ), DATA_MODE => 1, UNSAFE => 1 ); }
Our final example shows the fruits of our labors. We have successfully embedded a custom subset of XHTML into Perl. Now we can use it to create XML fragments with very little syntactic overhead. Further, because our embedding is "just Perl," we can freely mix code and fragments to do the work of template engines:
render_doc { html { head { title { text "My grand document!" } }; body { h1 { text "Heading" }; p { class_ "first"; # attribute class="first" text "This is the first paragraph!"; style_ "font: bold"; # another attr }; # it's just Perl, so we can mix in other code for (2..5) { p { text "Plus paragraph number $_." } } }; }; }; # <html> # <head> # <title>My grand document!</title> # </head> # <body> # <h1>Heading</h1> # <p class="first" style="font: bold">This is the first paragraph!</p> # <p>Plus paragraph number 2.</p> # <p>Plus paragraph number 3.</p> # <p>Plus paragraph number 4.</p> # <p>Plus paragraph number 5.</p> # </body> # </html>
Thanks for taking the time to read this meditation! If you find anything about it unclear, or can think of a way to improve my writing, please let me know.

Cheers
Tom

Comment on Embedding a mini-language for XML construction into Perl
Select or Download Code
Re: Embedding a mini-language for XML construction into Perl
by Juerd (Abbot) on Nov 19, 2005 at 00:26 UTC

    But why?

Re: Embedding a mini-language for XML construction into Perl
by ambrus (Abbot) on Nov 19, 2005 at 15:14 UTC

    I liked this meditation. However, there's one thing I didn't like: the eval call in define_vocabulary, which is unneccessarry.

    Let me show the way to do the same without eval.

    sub define_vocabulary { no strict "refs"; my($elems, $attrs) = @_; for (@$elems) { my $name = $_; *{$_} = sub(&) { _elem($name, @_) }; } for (@$attrs) { my $name = $_; *{$_ . "_"} = sub($) { _attr($name, @_) }; } }
      I had considered not using eval, but in this case I could see no practical advantage to the alternatives – and eval was more consistent with my goal of simplicity. For non-trivial uses, however, I agree that eval's costs (e.g., quoting clutter, parsing overhead, and security concerns) almost always outweigh its benefits.

      Still, if you think there is something inherently evil about eval that ought to eliminate it from all consideration, I would be interested in hearing your reasoning.

      Thanks for your comment.

      Cheers,
      Tom

        Lots of people are mis-using eval. They are calling eval more often then would be neccessary, or compiling unsecure code with it. There are too few legitimate applications where eval is really useful, and these mis-uses are very common, so I've grown to dislike eval (eval-string of course, not eval-block).

        Here, the speed problem doesn't apply as the eval is ran only a few times on program startup, so there's nothing principially wrong in using eval in your application.

        However, I still feel that eval is too powerful for such simple things like creating a set of similar functions which can be done without eval. I just imagine eval as a hairy monster that I don't want to allow in my house even when it's well controlled, does the washing-up and does no harm. Also, I wouldn't like that people think such things are only possible with eval, because that could lead to an over-use of eval again.

        Still, if you think there is something inherently evil about eval that ought to eliminate it from all consideration, I would be interested in hearing your reasoning.

        Your eval version actually re-compiles the subroutines several times, if you use the symbol table version, this is only done once. While right now it might not seem a big issue, if these subroutines get more complex (input validation or something like that) it might become more of a cost.

        I could also see a usefulness for being able to define a vocabulary inside a package other than the current one. This is, of course, possible with eval version, but using the symbol table version it would be easier to check for accidental overriding of methods.

        I guess my point is that while eval works just fine now, it will likely not scale very well, and since the symbol table approach is not that much more complex, it probably makes sense to use that and leave room to scale.

        -stvn

        Here’s an idea: if you use assignment to globs, you can do it with local from within the doc function. That way, you could conjure the element-name functions into existence for the duration of document construction, and have them wink out of being as soon as the document is completed.

        Makeshifts last the longest.

Re: Embedding a mini-language for XML construction into Perl
by shmem (Canon) on Nov 21, 2005 at 01:54 UTC

    Thanks for this nifty piece of code.

    I have been playing around and made your code into a module - certainly one which pollutes the callers namespace, since all entity and attribute functions have to be exported. I dont know how to keep the clear syntax and have it OO at the same time. Hmm...

    But apart from this, I suggest some changes in the rules of your game:

    • To add an element called name, we call the function NAME and pass it a function that will construct its attributes and children.
    • a dash in attributes must be replaced by an underscore. Similar means must be provided for any character not allowed in perl subroutine declarations, but present in attribute names (e.g. xml:lang)

    The first change improves legibility and avoids clashes with perl core functions (e.g tr/// vs. <tr>). The second is necessary per DTD..

    Oh yes. About eval vs. symbol table - I don't see much difference here. Recompiled every time? no, each entity/attribute sub is created once via eval, and done. What shows up via perl -MO=Deparse is that every block as an argument to a subroutine prototyped as & gets its own code reference at each call of the sub if it occurs in a different context, but that's regardless of the use of eval. All blocks in calls to p in this snippet

    # it's just Perl, so we can mix in other code for (2..5) { p { text "Plus paragraph number $_." } }

    will have the same CODE reference, but the next call to p will have it's own.

    After all, eval is not evil. The hairy monster is -- perl, the father of perl's eval. What eval is used for is up to the programmer, and if people use eval to compile insecure code - then I guess the surrounding code isn't any better.

    --shmem

    Update

    hmm, this doesn't seem to be a minimial-cost interface because of the reasons stated above - because each block gets its own reference which doesn't get destroyed or re-used. Wrap render_doc { }; into a sub and call it 10000 times and you'll end up with +192 MB of memory...

    Update

    The problem seems to be the anonymous $render_fn which doesn't get deallocated. Changing the block to
    sub render_via_xml_writer { my $doc = shift; my $writer = XML::Writer->new(@_); # extra args go to ->new() # my $render_fn; # $render_fn = sub { sub render_fn { my $frag = shift; my ($elem, $attrs, $children) = @$frag; $writer->startTag( $elem, map {@$_} @$attrs ); for (@$children) { # ref() ? $render_fn->($_) : $writer->characters($_); ref() ? render_fn($_) : $writer->characters($_); } $writer->endTag($elem); }; # $render_fn->($doc); render_fn($doc); $writer->end(); }
    solves this issue. Sub in a sub? yes, render_fn has to see the my()-variables of render_via_xml_writer.
Re: Embedding a mini-language for XML construction into Perl
by metaperl (Curate) on Nov 21, 2005 at 19:32 UTC
    Here is our game plan. We will represent XML documents as trees of nested arrays
    I know this was a gymnastics exercise, but if you want to see a CPAN implementation see new_from_lol() in HTML::Element
Simplifying the syntax further
by tmoertel (Chaplain) on Nov 21, 2005 at 21:15 UTC
    To further reduce the syntax burden, we can eliminate many calls to the text constructor by letting element constructors accept an optional third argument for text content. In the common case, we no longer have need to call text. (Of course, should we want to call text for clarity, we still can.)

    For example, the following fragment:

    html { head { title { text "Title" } }; body { p { class_ "warning"; text "paragraph" } } };
    can be simplified to this:
    html { head { title {} "Title" }; body { p { class_ "warning" } "paragraph" } };

    To effect the new syntax rules, we need only change the _elem and define_vocabulary functions from our original implementation. The changes are simple and marked with a hash-bang (#!):

    sub _elem { my ($elem_name, $content_fn, $text) = @_; #! added $text arg # an element is represented by the triple [name, attrs, children] my $elem = [$elem_name, undef, undef]; do { local $__frag = $elem; $content_fn->(); text($text) if defined $text; #! new line }; push @{$__frag->[2]}, $elem; } sub define_vocabulary { my ($elems, $attrs) = @_; eval "sub $_(&@) { _elem('$_',\@_) }" for @$elems; #! proto eval "sub ${_}_(\$) { _attr('$_',\@_) }" for @$attrs; }

    Can you spot any other syntax-reduction opportunities?

    Cheers,
    Tom

      That looks uglier and is less obvious. I’d prefer if it were possible to have the block’s value taken as its text content. This seemed tricky at first, because you don’t want to force users to put an explicit return;, undef;, ''; or whatever at the end of a block to avoid having the last expression of every block added as text content. After some reflection, however, it’s not tricky at all.

      There are only two cases: either you have a complex element with multiple children, be they sub-elements or full-on mixed content; or you have an element with nothing but text in it. These are clearly distinguishable: if the element has nothing but text in it, it won’t have any children yet when the block returns; if the element has complex content, it will already have explicitly constructed children when the block returns.

      The changes for this turn out even more trivial. Here’s the original _elem modified to meet this spec, with hashbangs:

      sub _elem { my ( $elem_name, $content_fn ) = @_; # an element is represented by the triple [name, attrs, children] my $elem = [ $elem_name, undef, undef ]; my $ret = { local $__frag = $elem; $content_fn->(); }; #! keep ret +val push @{ $elem[2] }, $ret if defined $ret and not @{ $elem[2] }; #! + new line push @{ $__frag->[2] }, $elem; }

      The only case where you get strange behaviour is when the block for an empty element contains code, something like br { ++$breaks } – this would now have to be written as br { ++$breaks; undef }. But you can now say

      html { head { title { "Lorem ipsum" } }; body { # ... }; }

      If you use text or if you construct any other element explicitly, the block’s return value will not interfere.

      Makeshifts last the longest.

        Good contribution! Your simplified text syntax is prettier and more intuitive.

        The corner case where an empty element contains code is a small blemish and an easy price to pay for the benefits of the simplified text syntax. Maybe we can even reduce the blemish by introducing another helper that declares a block to represent an empty content model:

        sub empty(&) { shift->(); undef }
        Then the corner-case becomes:
        doc { br { empty { ++$breaks } } }
        It's still not perfect, but maybe we can think of yet another improvement.

        Cheers,
        Tom

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://510000]
Approved by thor
Front-paged by stvn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (11)
As of 2014-10-25 18:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (147 votes), past polls