nick has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on an XHTML parser, and for the most part I really like XML::Twig, it's very powerfull yet still fairly easy to use. However, there is still some weirdness that I cannot figure out how to fix.

For example, if I load in a valid XHTML document, add a field to it (such as a form tag), then output the parsed data. I would expect the output would be valid XHTML as well, but it isn't. See example:

This is my valid XHTML document:
<html> <head> <title>example</title> </head> <body> <text>Foo Bar</text> </body> </html>

Here is my code, which will insert a form tag into the body.
#!/usr/bin/perl -w use strict; use XML::Twig; sub main() { my $filename = $ARGV[0] ? $ARGV[0] : die "specify input filena +me.\n"; my @tags = ('input', 'textarea', 'checkbox'); my $xml = XML::Twig->new( keep_spaces_in => [ 'pre' ], pretty_print => 'indented', twig_roots => { 'body' => \&insert_form_tags, }, twig_print_outside_roots => 1, ); $xml->parsefile("$filename"); $xml->print; } sub insert_form_tags() { my ($xml, $body) = @_; my $form_group = 'process_group'; my $form_name = 'process_requirements'; my $form = $body->insert( form => { method => "Post", action => "submit.cgi", }, ); } &main();

This is simple code to input form tags, and it works, the form tags are inserted, however the output of XML::Twig generates invalid XHTML. See output:
<html> <head> <title>example</title> </head> </html> <html> <body> <form action="submit.cgi" method="Post"> <text>Foo Bar</text> </form> </body> </html>

Notice the closing, and then re-opening of the html tag right after the head, and before the body. When I try to parse this outputed file with XML::Twig it gets an error right at that spot. Removing the offending lines resolves the problem.

So my question is, how do I get XML::Twig to output valid XHTML after adding this form tag?

Thanks in advance for any help!

- Nick

Replies are listed 'Best First'.
Re: XML::Twig generating invalid XHTML output??
by Tanktalus (Canon) on Feb 14, 2005 at 17:26 UTC

    This is not quite how I normally use XML::Twig (I avoid the handlers, but that's just me ;-}), but here's a potential solution. Change your constructor to:

    my $xml = XML::Twig->new( keep_spaces_in => [ 'pre' ], pretty_print => 'indented', twig_roots => { 'body' => \&insert_form_tags, 'html' => sub {}, }, twig_print_outside_roots => 1, );

    That seemed to solve it here.

      Thanks very much for such a quick response. That seemed to do the trick for me as well. Any ideas why? What changes triggering an empty subrouting to be called for the html tags?
      - Nick

        Could it be that you're suppose to use twig_handlers instead of twig_roots, since you don't want body to become a root node? The example from the module's documentation is: (irrelevant details omitted)

        my $twig=XML::Twig->new( twig_handlers => { title => sub { $_->set_gi('h2') }, # Change title tags to h2. para => sub { $_->set_gi('p') }, # Change para to p. hidden => sub { $_->delete; }, # Remove hidden elements. list => \&my_list_process, # Process list elements. }, pretty_print => 'indented', # Output will be nicely formatted. );

        The answer to your question of "why" depends on how deep of an answer you're looking for.

        On the deepest level, I'm not even going to pretend to look at the XML::Twig code (or any of the miriad of levels of code under it). So, no, no ideas why from that level.

        From a higher, XML::Twig-is-a-black-box level, how I got this answer was from playing with the code. First thing I did was comment out the twig_print_outside_roots => 1, line. That showed me the reason why you put it in in the first place: all of a sudden, everything outside of the body tag stopped being printed. That gave me the idea to try adding a new root such that everything would be in it - then you don't have anything outside of the roots (so that line is no longer really needed... I think). Since 'html' is the root of everything, I figured it would be the appropriate root to use. At this point, the question is: what do we want to do with that root. The answer is simple: nothing. And that's what I told perl: do nothing.

        Does that help answer the question?

Re: XML::Twig generating invalid XHTML output??
by mirod (Canon) on Feb 14, 2005 at 18:52 UTC

    As noted before, you should be using either twig_roots/twig_print_outside_roots OR print at the end of the processing.

    I'll try to explain:

    You use twig_roots/twig_print_outside_roots to write filters: anything you are not interested in is output as you parse the document. The elements you are interested in are processed in the handler. But then you have to output them, once processed, at the end of the handler. Then they are output at the proper time. So in your exemple you would have a $body->print at the end of insert_form_tags(), and you would remove the $xml->print at the end of your main code.

    OTOH, if your documents are small (ie likely to fit in memory), then there is really no need to use twig_roots/twig_print_outside_roots. You can either load the entire document in memory and then go from here (the body would be $xml->root->first->child( 'body') or $xml->first_elt( 'body')), or use twig_handlers to process elements "in place" during the parsing and then output the entire document in the end (you can also use this mode and then flush the twig at the end of the handler and then again at the end of the parsing if memory is an issue).

    Does this make sense? If yes I will probably add this to the FAQ.