http://www.perlmonks.org?node_id=441275

Easy XML production using Lazy Evaluation, Closures, Currying, and overloaded operators.

Introduction

This tutorial/meditation is the result of my search for a package that can write XML fragments the way CGI.pm writes HTML fragments using nested functions. The functions in this mythical package take a list of entity contents to enclose between XML open and close tags. They also take an optional hash reference specifying XML attributes. Some of the entity contents might be nested XML entities, which include their own tags, attributes, contents, sub-entities, etc. The resulting nesting of XML elements should correspond to nested function calls in the code creating the XML fragment.

Motivation

I posted this here because it is an interesting exploration of lazy evaluation, closures, and currying function arguments, with a splash of operator overloading for that last little bit of syntactic sugar. If you do not know what these things are, or you have never seen an example that demonstrates how useful these things can be, you might find the following code interesting. On the other hand, if you have more experience than I do, you might have suggestions for better ways to accomplish this.

Instructions

This code was designed to be downloaded in one file and run as you read. At a few key places in the code there are __END__ perl tokens. Perl will stop executing the code when it reaches the first of these tokens. To execute the code that follows one of these tokens you simply comment out the token by placing a '#' in front of it (like this #__END__) and running the program again. Then the code will be executed up to the next __END__ token. It might be useful to run the code in the debugger to see exactly how things happen, but sometimes, due to the recursive nature of the code, this can cause more confusion than understanding. If you do use the de bugger, watching the $tag variable may help you keep track of where you are in the recursion.

The usual caveats for running someone else's code on your machine apply. I hope you will be able to understand the code before you run it. (That is why I included the END tokens and the writeup.) This works on my machine. I cannot guaranty it will work on yours.

Or, you can just read the node. I'm not the boss of you.

On to the code!

Utility Functions

We start with the usual sanity check pragmas and two utility functions. the first function, escape_ents, converts the characters not allowed in XML data into representations that are legal in XML. The second function, stringify_attribs, converts a hash into XML attributes to be used in a tag. There are probably more robust ways to accomplish these things but this way works and is simple to understand.

use warnings; use strict; sub escape_ents { local $_ = shift; s/&/&amp;/g; s/</&lt;/g; s/>/&gt;/g; s/"/&quot;/g; # " s/'/&apos;/g; return $_; } sub stringify_attribs { return join '', map{ ' '.escape_ents( $_ ).'="'.escape_ents( $_[0]{$_} ).'"' } sort keys %{$_[0]}; }

The XML_elem function

This is where the XML is generated. This function is used for the rest of this discussion so spend a few moments studying what it does. It's not too complicated.

The first argument is the name of the XML element. It is stored in the variable $tag.

The array @result is where the resulting XML pieces are stored.

If the second argument is a hash reference then the reference gets stored in $atribs.

The first thing stored in the @results array is the opening XML tag. If there are attributes stringify_attribs converts them to XML attributes inside the opening tag.

If there are no other arguments the opening XML tag is turned into an empty element tag, returned, and the function exits.

If there are remaining arguments they are processed one by one. If an argument is a code ref (used to produce XML-sub-elements) then execute the code, and store the results in @results. If it is plain text, make sure it doesn't contain any illegal characters by calling escape_ents; indent it to visually represent how the XML elements nest inside each other; and store the result in the @results array.

Finally append the closing XML tag to the @results array and return the results to the caller. XML_elem leaves it up to the caller to concatenate all the results together.

Pretty simple stuff. Nothing too complicated here.

sub XML_elem { my $tag = shift; my @results; # return value my $attribs = shift @_ if( ref( $_[0] ) eq 'HASH' ); push @results, "<${tag}" . stringify_attribs( $attribs ) . ">"; if( @_ == 0 ) { # handle an empty element $results[0] =~ s|>$| />|; return @results; } foreach my $arg ( @_ ) { if( ref( $arg ) eq 'CODE' ) { push @results, map{ " $_" } $arg->(); # <--- ??? } else { push @results, ' ' . escape_ents( $arg ) } } push @results, "<\\${tag}>"; return @results; }

First Attempt

The following code fragment demonstrates how I hoped to use XML_elem to produce XML. By using XML_elem as an argument to itself I wanted to produce nested XML elements. I hoped this example would produce three nested XML elements. The outermost element should have had the tag name "root". Inside the "root" element should have been the "branch" element, and inside the "branch" element should have been the "sub-branch" element. There are some attributes, text contents, and entities to be converted into their XML representation included to complete the example.

The results should look like this

<root ID="0">
  <branch>
    <sub_branch foo="2">
      some contents & entities, "<>"
    </sub_branch>
    other contents
  </branch>
  root stuff
</root>
print "\n == code fragment 1 ========================================= +==== \n"; print join "\n", XML_elem( 'root', { ID => 0 }, XML_elem( 'branch', XML_elem( 'sub_branch', { foo => 2 }, 'some contents & entities, "<>"' ), 'other contents', ), 'root stuff', );

Can you see where I went wrong? Try the code and see what happens. (Or just read on.)

The First Results

The problem is Perl evaluates the nested XML_elem functions from the inside out. The innermost XML_elem function is evaluated before the function it is an argument to is called. When the enclosing XML_elem function is called all it gets is the results array. The contents of the results array are all strings even though some of them are strings containing XML tags. So, the XML tags and entities produced while creating the "sub-branch" element are escaped while they are being interpolated into the "branch" element. Then those results are escaped again when they are interpolated into the root element. The result is a big mess. It is valid XML, but not the XML I wanted.

The Problem

Since none of the arguments to XML_elem are function references the line marked by "<--- ???" in XML_elem is never executed. The way XML_elem is designed it identifies an argument is an XML-sub-element by determining if the argument is a code reference. If it is a code reference then XML_elem does not escape any entities produced by that argument. A sub-element's function is responsible for escaping it's own entities.

Note: In code fragment 1 the list of results from XML_elem are joined together separated newlines before the results are printed. XML_elem returns an array so we can, in theory, nest XML elements created in one part of a program inside XML generated in another part of the program. (It doesn't work at this point but it will soon) The code fragments in this meditation create the XML fragment all in one place, but, using the final product, you could distribute the creation of the XML elements to wherever it is most convenient.

Lazy Evaluation

What is needed is to keep the nested XML_elem functions from being evaluated until the calling function needs the results. This is called Lazy Evaluation. Perl can delay the evaluation of a function by passing a code reference as an argument and calling the function through the code reference inside the function, instead of calling the function in the argument list.

The problem is, if we use a function reference, how do we pass arguments to the function when it is called via the reference deep inside nested calls to XML_elem? The solution is to use a closure to bind the subroutine reference and the arguments together into a function that can be called with no arguments.

The First Closure

This deceptively simple subroutine, called "ml" for make lazy, solves our problem. Each time it is called it packages the subroutine reference in the variable $sub and all the remaining arguments in the array @args, and returns a reference to an anonymous subroutine. When the anonymous subroutine is called it simply returns the result of calling the subroutine, via the reference stored in $sub, using the arguments stored in @args. Using this method we can delay the execution of XML_elem until the anonymous function ml returns is executed.

Since $sub and @args are lexical variables, declared with "my", a new set is created each time ml is called. Additionally, as long as we keep a reference to each subroutine that ml returns, each pair of $sub and @args variables will be kept. Perl keeps distinct copies of $sub and @args together inside each anonymous subroutine so the pointer to the subroutine implicitly contains the proper $sub, and @args variables. This is the magic of closures used to produce lazy evaluation.

Comment out the following __END__ token to see it work.

__END__ # ml (Make Lazy) takes a reference to a subroutine as it's 1st arg. A +ll # remaining args are saved and used as arguments for the subroutine wh +en # the results of execution are desired. returns a subroutine that beh +aves # almost exactly as if the subroutine had been called when ml was call +ed. sub ml { my $sub = shift; my @args = @_; return sub { return $sub->( @args ); }; }

Second Attempt

This code fragment demonstrates ml in action. Notice how all the arguments that were passed to XML_elem in the first attempt are now passed to ml preceded by a reference to the XML_elem function.

print "\n == code fragment 2 ========================================= +==== \n"; print join "\n", XML_elem( 'root', { ID => 0 }, ml( \&XML_elem, 'branch', ml( \&XML_elem, 'sub_branch', { foo => 2 }, 'some contents & entities, "<>"' ), 'other contents', ), 'root stuff', );

Currying Arguments Part 1

This works. It produces the desired output, but the function arguments look cluttered. Some nesting of function calls is desired because it indicates which XML elements are nested inside each other. But the amount of nesting in fragment 2 seems clumsy, and excessive. The clumsiness is most prominent in the inconsistency of the indentation scheme. Poor code layout, of course, does not necessarily mean poor code, but the fact that there is almost no way to consistently lay out the code in a way that indicates the code's purpose is a hint that there might be a better design.

I changed the design by changing how I call ml. Code fragment 2 crams too many arguments of different kinds into ml. The first argument is a reference to a subroutine. The second argument is name of the XML tag. The remaining arguments are the contents of the XML element including an optional attribute hash. These are all mashed together in one argument list. You have to count arguments for each function to determine what is what. Until I learned there was another way this argument mashing seemed normal. However, there is a way to break these arguments apart; to specify the subroutine reference and the XML tag name in one place and the XML element contents and attributes somewhere else. This is called currying arguments.

When we curry the arguments to ml we call one function with the XML_elem subroutine reference and the tag name in one place, then, somewhere else we supply the remaining XML entity contents.

Currying arguments requires a way to keep track of which functions have been called with which arguments, and what remaining arguments are needed. What happens is when the first arguments are supplied the function returns a reference to an anonymous function that takes the remaining arguments. In the code fragment below the anonymous function pointers are stored in variables whose names are the names of the XML tags. This makes it easy to track which arguments go with which functions.

Some languages implement currying automatically. If you don't supply enough arguments to a function it curries those arguments and returns an anonymous function where you supply the rest of the arguments. With Perl you have to be a little more explicit. Fortunately, Perl implements currying using our old friend the closure. Our closure simply stores the first two arguments to ml in lexical variables, and returns a subroutine which takes the remaining arguments and then calls ml.

Comment out the following __END__ token to see it work.

__END__ # ca4ml (Curry Arguments for ML) sub ca4ml { my $sub = shift; my $tag = shift; return sub { my @args = @_; return ml( $sub, $tag, @args ); } } print "\n == code fragment 3 ========================================= +==== \n"; my $root = ca4ml( \&XML_elem, 'root' ); my $branch = ca4ml( \&XML_elem, 'branch' ); my $sub_branch = ca4ml( \&XML_elem, 'sub_branch' ); print join "\n", $root->( { ID => 0 }, $branch->( $sub_branch->( { foo => 2 }, 'some contents & entities "<>"' ), 'other contents', ), 'root stuff', )->();

Currying Results

Doesn't code fragment 3 look much nicer than code fragment 2. It is much easier to see which XML attributes and contents go in which entities. The code which produces the XML mimics the XML layout almost exactly.

The only tricky part is the final anonymous function call indicated by the ")->()" in the last line of the code fragment. Remember, ml returns an anonymous function. If we don't execute the outermost anonymous function no XML is produced. In fact, XML_elem is not called until that anonymous subroutine is executed. Up to that point all we've done is assemble a hierarchy of anonymous functions and argument lists which contain more anonymous functions. We must execute the outermost anonymous function to actually produce the XML.

One other issue with this implementation that can be improved. Notice how the first argument to ca4ml is always a reference to the subroutine XML_elem. To do what we want to do it always will be. Well, if we know for sure what the argument will be we can just specify it in ca4ml.

The final example shows this change in action. I changed the name from ca4ml to make_func. This function, and the anonymous functions it returns, are the only interface to this XML writing technique.

Comment out the following __END__ token to see it work.

__END__ sub make_func { my $tag = shift; return sub { my @args = @_; return ml( \&XML_elem, $tag, @args ); } } print "\n == code fragment 4 ========================================= +==== \n"; $root = make_func( 'root' ); $branch = make_func( 'branch' ); $sub_branch = make_func( 'sub_branch' ); print join "\n", $root->( { ID => 0 }, $branch->( $sub_branch->( { foo => 2 }, 'some contents & entities "<>"' ), 'other contents', ), 'root stuff', )->(); ; __END__

Closing Comments

I the introduction I said I was looking for a package which wrote XML using this nested function paradigm. What we have so far is not a package but it can easily be turned into a package with a rather novel interface. The title mentions overloaded operators; Notice, I haven't overloaded any operators yet. You need to have a package to overload operators.