Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

XML::Twig and Processing Instructions

by eff_i_g (Curate)
on Mar 19, 2009 at 16:28 UTC ( [id://751778]=perlquestion: print w/replies, xml ) Need Help??

eff_i_g has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

Given the s[ai]mple XML file:
<root> <element>One<?xpp qa?>Two</element> </root>
I want to wrap "Two" in <?xpp bold?> and <?xpp /bold?>.

My approach is to create a twig_root of element, loop through its children, find "Two", then add the PIs.

The trouble is that I'm getting one child ("One<?xpp qa?>Two"), when I think I should be getting three ("One", "<?xpp qa?>", "Two"). Alas, the PI is being lumped in with the PCDATA.

Sure, I could parse the PI out of the PCDATA, but that doesn't seem right. Perhaps I've a misunderstanding, a bug, or have overlooked something in the XML::Twig docs?

Insights are appreciated.

P.S. The test code:
use strict; use warnings; use XML::Twig; my $XML = XML::Twig->new( twig_roots => { 'element' => sub { for my $child ($_->cut_children()) { $child->print(); print "\n"; } } }, pretty_print => 'indented' ); $XML->parse(*DATA); print "\n"; __DATA__ <root> <element>One<?xpp qa?>Two</element> </root>

Replies are listed 'Best First'.
Re: XML::Twig and Processing Instructions
by mirod (Canon) on Mar 19, 2009 at 16:45 UTC

    By default the PIs are "hidden" in the text (in fact the text of the element doesn't even include the PI). In order to get 3 children, you need to use the pi => 'process' option when you create the XML::Twig object. See the pi option doc. You can then even set a handler on the PIs

Re: XML::Twig and Processing Instructions
by eff_i_g (Curate) on Mar 19, 2009 at 17:07 UTC
    Excellent! Thank you mirod!

    Out of curiosity, why are PIs "hidden"? Does it makes things easier on the processing (pun intended) side since many folks may never even touch them?

      They are hidden because they are not described in the DTD (are described in W3C Schemas?). So when you make assumptions about the kind of XML you're going to process based on the DTD, PIs (and comments) can trip you up, by splitting up text nodes, or showing up as child/sibling when you don't expect it. Using XPath (or XPath-like navigation in XML::Twig) mitigates the risk, but doesn't eliminate it. So I thought it would be safer to get them out of the way. Especially as in the old days, when XML::DOM and XML::Parser were at the cutting-edge of XML technology, I saw way too many examples in books and "serious" web sites that would not have dealt properly with random comments or PIs.

      This way, if you're concerned about PIs and/or comments you can access them, and otherwise you can safely ignore them. They will still be preserved as much as possible: comments or PIs before a start tag will follow the element if it is moved around, they will be preserved properly even when outside the root or inside the text... if you want to be scared look for cpi (comments and PI's) or extra_data (that's how I used to call them before I got lazy) in the source.

        Thanks for the explanation, mirod.

        I'm typically creating XML that dead-ends into a typesetting system (and may need its structure changed on a whim); therefore, DTDs are not created, and, as a result, my knowledge of them remains scant.

        For my fellow SOPW, here's what I ended up with (I changed the content of element so the example would make a little more sense):
        use strict; use warnings; use XML::Twig; my $XML = XML::Twig->new( pi => 'process', pretty_print => 'indented', twig_roots => { 'element' => sub { for my $child ($_->cut_children()) { if ($child->is_pcdata()) { for my $piece (split /(Warning:)/i, $child->trimme +d_text()) { my $pcdata = XML::Twig::Elt->new('#PCDATA' => +$piece); if ($piece =~ /Warning:/i) { my $b_start = XML::Twig::Elt->new('#PI'); $b_start->set_target('xpp'); $b_start->set_data('bold'); my $b_end = XML::Twig::Elt->new('#PI'); $b_end->set_target('xpp'); $b_end->set_data('/bold'); $b_start->paste('last_child', $_); $pcdata->paste('last_child', $_); $b_end->paste('last_child', $_); } else { $pcdata->paste('last_child', $_); } } } else { $child->paste('last_child', $_); } } $_->flush(); } }, ); $XML->parse(*DATA); print "\n"; __DATA__ <root> <element>A sentence about the product<?xpp qa?>Warning: This may s +pontaneously combust.</element> </root>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://751778]
Approved by moritz
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-09-08 18:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.