Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: XML::Twig not finding an element's parent's text

by choroba (Cardinal)
on May 18, 2025 at 17:49 UTC ( [id://11165056]=note: print w/replies, xml ) Need Help??


in reply to XML::Twig not finding an element's parent's text

When Twig is processing the bookmark element, it hasn't yet seen the text. It only knows the part of the parent up to the element itself. That's how SAX-like parsers work. You can try adding text before the bookmark element to verify Twig prints it out.

You can set the handler expression to text:h[text:bookmark] (i.e. "an h element with a bookmark child") instead and print the h directly instead of the parent.

If you need more refined navigation, switch to XML::LibXML.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^2: XML::Twig not finding an element's parent's text
by mldvx4 (Friar) on May 18, 2025 at 18:16 UTC

    Thanks! That helped.

    You can try adding text before the bookmark element to verify Twig prints it out.

    That sounded promising but has no effect. Whether before or after the selected element, the text is not available.

    You can set the handler expression to text:h[text:bookmark] (i.e. "an h element with a bookmark child") instead and print the h directly instead of the parent.

    The thing is that many kinds of elements may contain <text:bookmark text:name="..."/> so the search has to be on text:bookmark as far as I can tell. However, it looks like a slight modification is the way to go: If I select *[text:bookmark] and then go digging deeper from there, that could work:

    #!/usr/bin/perl use XML::Twig; use strict; use warnings; my $xml = XML::Twig->new( twig_handlers => { '*[text:bookmark]' => \&handler_bookmark } ); # twig_handlers => { 'text:bookmark' => \&handler_bookmark } ); $xml->parse(\*DATA); print qq(\n-\n); $xml->print; exit(0); sub handler_bookmark { my( $twig, $bookmark)= @_; print qq(OK\n); print $bookmark->text; my @bmk = $bookmark->children('text:bookmark'); foreach my $b (@bmk) { my $anchor = $b->att('text:name'); print "Anchor: ", $anchor, "\n"; } } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <text:h text:style-name="P900" text:outline-level="3"> Bar foo <text:bookmark text:name="_asdfqwerzxcv"/>Foo bar </text:h>

    I'll test and get back in a day or so.

      > That sounded promising but has no effect. Whether before or after the selected element, the text is not available.

      I probably wasn't clear enough. This was not an advice how to solve the problem, it was an attempt to show you how Twig behaves.

      #!/usr/bin/perl use warnings; use strict; use XML::Twig; my $xml = XML::Twig->new( twig_handlers => { 'text:bookmark' => \&handler_bookmark } ); $xml->parse(\*DATA); print qq(\n\n); # $xml->print; exit(0); sub handler_bookmark { my ($twig, $bookmark)= @_; $bookmark->parent->print; } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <text:h text:style-name="P900" text:outline-level="3"> BEFORE<text:bookmark text:name="_asdfqwerzxcv"/>Foo bar </text:h>
      Output:
      <text:h text:outline-level="3" text:style-name="P900"> BEFORE<text:bookmark text:name="_asdfqwerzxcv"/></text:h>
      See? "BEFORE" is there, while "Foo bar" is not.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        Thanks. Working with parent elements containing the target element was the way to go, rather than aiming directly at the target element itself. See the first handler below.

        There are a lot of handlers in this script, but here are a pair relevant to my original question:

        my $xml = XML::Twig->new( pretty_print => 'nsgmls', # nsgmls for parsability output_encoding => 'UTF-8', twig_roots => { 'office:body' => 1 }, twig_handlers => { # link anchors (text:boomark) must be handled before # processing the internal links '*[text:bookmark]' => \&handler_bookmark, . . . $xml = XML::Twig->new( pretty_print => 'nsgmls', empty_tags => 'html', output_encoding => 'UTF-8', twig_roots => { 'office:body' => 1 }, twig_handlers => { # links (text:a) must be handled separately from link targets 'text:a' => \&handler_links, . . . sub handler_bookmark { my ($twig, $bookmark)= @_; my @bmk = $bookmark->children('text:bookmark'); foreach my $bk (@bmk) { my $l = $bk->trimmed_text; my $t = $l; $t =~ s/\s/_/g; my $anchor = $bk->att('text:name'); $bookmarks{$anchor}{'label'} = $l; $bookmarks{$anchor}{'target'} = $t; $bk->set_text("\n { ".$anchor.' }'); $bk->parent->merge($bk); } } sub handler_links { my ($twig, $link)= @_; my $href = $link->att('xlink:href'); $href =~ s/^\#//; my $l = $bookmarks{$href}{'label'}; my $t = $bookmarks{$href}{'target'}; if (! $l) { $l = $link->trimmed_text; $link->set_text("[$href $l]\n"); } else { $link->set_text("[$t $l]\n"); } $link->parent->merge($link); } . . .

        These two handler subroutines are each used in separate parsing pass, for a total of two passes. Strangely, two parsings seems to be faster than one pass with all the handlers in a single object. The first pass collects a hash of link targets and their labels. The second pass applies those to the links pointing at those targets.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11165056]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (2)
As of 2025-06-22 08:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.