ateague has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE:
This appears to have been fixed at or before XML::Twig version 3.52. Thank you Mirod for your work!


Good evening!

I am using XML::Twig to transform some simple XML data with ActivePerl 5.020 x64.

I am turning data like this:
<?xml version="1.0" encoding="UTF-8"?> <shipment> <box> <packing> <slip>potato</slip> <slip>pear</slip> <slip>peach</slip> </packing> </box> <box> <packing> <slip>apple</slip> </packing> </box> </shipment>
and "flattening" (I think that is the correct term) into something that looks like this:
<?xml version="1.0" encoding="UTF-8"?> <shipment> <slip _TYPE="produce">potato</slip> <slip _TYPE="produce">pear</slip> <slip _TYPE="produce">peach</slip> <slip _TYPE="produce">apple</slip> </shipment>

The issue I am coming up against is that the final "</shipment>" tag gets duplicated whenever I process a document that has a "<box>...</box>" section without any <slip> sections in it.

Thank you for your time.

Sample code and output:

#!/usr/bin/perl use 5.020; use strict; use warnings; use XML::Twig; ## ## This works as expected ## my $working_xml = XML::Twig->new( twig_handlers => { '/shipment/box' => sub { _move(@_); 1; }, }, pretty_print => 'indented', )->safe_parse( <<"XML" <?xml version="1.0" encoding="UTF-8"?> <shipment> <box> <packing> <slip>potato</slip> <slip>pear</slip> <slip>peach</slip> </packing> </box> <box> <packing> <slip>apple</slip> </packing> </box> </shipment> XML ); say "\n\n---------------------\n\n"; ## ## This 'breaks', printing the final '</shipment>' tag twice ## my $broken_xml = XML::Twig->new( twig_handlers => { '/shipment/box' => sub { _move(@_); 1; }, }, pretty_print => 'indented', )->safe_parse( <<"XML" <?xml version="1.0" encoding="UTF-8"?> <shipment> <box> <packing> <slip>potato</slip> <slip>pear</slip> <slip>peach</slip> </packing> </box> <box> <packing> <note>nothing here!</note> </packing> </box> </shipment> XML ); # ---------- sub _move { foreach my $descendant ( $_[1]->descendants('slip') ) { $descendant->set_att('_TYPE' => 'produce'); # Move the <slip> out of the <box> section # # This appears to be the problem, even though # this should never be reached in the second # document (no <slip> descendants) $descendant->move('before', $_[1]); } $_[1]->delete; $_[0]->flush(); 1; }
Output:
<?xml version="1.0" encoding="UTF-8"?> <shipment> <slip _TYPE="produce">potato</slip> <slip _TYPE="produce">pear</slip> <slip _TYPE="produce">peach</slip> <slip _TYPE="produce">apple</slip> </shipment> --------------------- <?xml version="1.0" encoding="UTF-8"?> <shipment> <slip _TYPE="produce">potato</slip> <slip _TYPE="produce">pear</slip> <slip _TYPE="produce">peach</slip> </shipment> </shipment>
  • Comment on [SOLVED] Moving elements with XML::Twig: Document root closing tag is duplicated on output
  • Select or Download Code

Replies are listed 'Best First'.
Re: Moving elements with XML::Twig: Document root closing tag is duplicated on output
by Tanktalus (Canon) on Feb 19, 2016 at 22:12 UTC

    Normally, I'd expect flushing to be required outside of the twig handlers. What it kind of looks like is happening, besides being a probable bug in XML::Twig, is that you're flushing even when you're not changing anything. For example, updating _move to be:

    sub _move { my $did_something = 0; foreach my $descendant ( $_[1]->descendants('slip') ) { $did_something++; $descendant->set_att('_TYPE' => 'produce'); # Move the <slip> out of the <box> section # # This appears to be the problem, even though # this should never be reached in the second # document (no <slip> descendants) $descendant->move('before', $_[1]); } $_[1]->delete; $_[0]->flush() if $did_something; 1; }
    seems to resolve the problem. Also, adding $broken_xml->flush; to after the call to safe_parse does not bring back the problem. And I think that extra flush is something XML::Twig documents as being required in some circumstances anyway, IIRC.

      And I think that extra flush is something XML::Twig documents as being required in some circumstances anyway, IIRC.

      Correct. According to the XML::Twig docs, "if you use flush at any point during parsing, the document will be flushed one last time at the end of the parsing, to the proper filehandle."

      For example, updating _move to be:<snip...>

      That is essentially what I ended up doing in my final code to work around the problem:

      sub _move { # Skip this section if no "<slip>" descendants are found if ( !scalar $_[1]->descendants('slip') ) { $_[1]->delete; $_[0]->purge; } # "<slip>" descendants found, process as normal else { <...> $_[1]->delete; $_[0]->flush(); } 1; }

      This appears to work, but I don't know why it works, or even what the root problem is. It all feels very cargo-cult right now...

Re: Moving elements with XML::Twig: Document root closing tag is duplicated on output
by choroba (Archbishop) on Feb 20, 2016 at 03:58 UTC
    Just for comparison, the same task in xsh, a wrapper around XML::LibXML:
    open 1155670.xml ; for /shipment/box xmove packing/slip replace . ; xinsert attribute '_TYPE="produce"' into /shipment/slip ; for /shipment/slip[not(following::node()[1][self::text()])] xinsert text {"\n "} after . ; save :b ;

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks.

      I will certainly keep that in mind for future projects and experiment with it when I have the time. Always nice to have more tools in my tool belt.

Re: Moving elements with XML::Twig: Document root closing tag is duplicated on output -- solution
by Discipulus (Abbot) on Feb 20, 2016 at 22:44 UTC
    Even if i dunno exactly what happen with your code, i'm sure it is wrong. the way you set up the handler made me immediatly suspicious. In fact I never seen a syntax like your

    '/shipment/box' => sub { _move(@_); 1; }

    Even the SYSNOPSIS for the XML::Twig reports the example on how to call a subroutine to process the part intercepted by th handler:

    list    => \&my_list_process,

    The, perfectly correct and valid, style to call a sub with &subname is described in perlsub where is stated &NAME;       # Makes current @_ visible to called subroutine. If someone tell you is Perl4 rubbish.. let me know.. ;)

    So changing your

    # '/shipment/box' => sub { _move(@_); 1; } #wrong # to '/shipment/box' => \&_move,

    makes everything much clearer and Twiggy.

    For the rest I can say that, also you have an unneeded 1; as last thing in your sub. Seems more logical to have an exit condition if no slip are found (..ooops):

    sub _move { unless ($_[1]->descendants('slip')){ $_[0]->purge(); return; } foreach my $descendant ...

    This way the program runs as we expected:

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      There's nothing wrong with '/shipment/box' => sub { _move(@_); 1; }. This is perfectly valid code. It means, call _move with all my parameters, but always return 1 - and this is all wrapped up with a code reference (it's an anonymous, or unnamed, sub). That return code, if you read the XML::Twig documentation, is relevant: "If a handler returns a true value and other handlers apply, then the next applicable handler will be called." I'm not sure what other handler there is, but there may be something trying to flush again here that I'm not seeing. When you change the return value, you may be changing the calling of this other handler, and that may stop the flushing. However, since handlers are documented to have meaningful return codes, it's probably best if you provide said meaningful return code.

        thanks Tanktalus to have taken the time to clarify this matter to me.

        Anyway maybe mirod can explain the details. You know (or well maybe not..) that my Perl knoweledge is a little.. empyric and if by one hand i tend to try many times unusefull and dangerous things with Perl, by other hand i tend to play gently when using modules, trying to keep me more close that i can to what the module's docs say. Because i ignore what XML::Twig does under the wood. Someone can learn Perl, no one CPAN (at least not me).

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Pretty much what Tanktalus said earlier. This code is a vastly simplified example piece that reproduces the issue in question. I tried to trim out as much unnecessary cruft to make the example as small and concise as possible while still remaining relevant

      In answer to some of your questions:

      1. the way you set up the handler made me immediatly suspicious.

        the live code has some additional processing that happens after the twig is processed. It would look something like the following simplified example:

        sub { my $local_state = _move(@_); $global_state = _do_stuff_with_local_state($local_state->{SUMMARY}, \%VARS);  1; },
      2. The, perfectly correct and valid, style to call a sub with &subname

        Correct. The actual code passes extra variables to the twig processing subroutine in addition to the default TWIG and ELEMENT variables passed to the handler:

        sub { _move(@_, $local_state, \%OPTIONS, $warning_flag); 1; },
      3. For the rest I can say that, also you have an unneeded 1; as last thing in your sub.

        Pretty much what Tanktalus said earlier.

        The function called by the handler returns '1' in my simplified example (instead of variables) because it is an explicit 'true' return value

        The anonymous sub returns '1' because there are multiple twig handlers that may apply to a given twig:

        '/shipment/box' => sub { _move(@_); 1; }, '/shipment/box[@location='Vault_111']' => sub { _moveAgain(@_); 1; },

      Hope that clarifies things.

        yes clarify enough..

        I see you have more experience that me. I just tried to bypass the unwanted behaviour cleaning as much as possible your code.

        As for my first sentence i dunno what happen with your code. Anyway i msgd also the XML::Twig author about the possibility to spend some spare time with this problem.

        Anyway the general advise of keep things simple is still valid. A stream of XML and the parser involved is something that can block even rough stomachs, at least mine. I'm sure (at this point) that you read carefully the docs, but in the part dedicated to twig_handlers is declared:

        Warning: if you have used purge or flush on the twig the element might not be complete, some of its children might have been entirely flushed or purged, and the start tag might even have been printed (by flush) already, so changing its gi might not give the expected result.

        that might be relevant in your case.

        Have you noticed that changing $_[0]->flush(); to $_[0]->print(); in the sub does not provoke the unwanted behaviour?

        But in the docs is said that print must be used AFTER the parse, so I've commented it in the sub and I've added it after the safe_parse $working_xml->print(); and $broken_xml->print(); respectively: the bad doubled closing tag does not appears anymore.

        Share your further discoveries and good luck.

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.