Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element

by mldvx4 (Pilgrim)
on Nov 15, 2021 at 09:11 UTC ( #11138822=perlquestion: print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Thanks for the previous.

I have a question about either HTML::TreeBuilder::XPath or HTML::Element, and the interaction between them. I would like to manipulate the content of an element while leaving all its children in place. I'm not able to find a way around that because it appears that replace_with() also automatically and unavoidably escapes the < and > signs. The example below uses ~literal but I've also tried creating a new element. Either way, the child elements within the selected element get escaped despite my best efforts. How would it be possible to do something like the following (using a different work flow if necessary) such that the tags for the child elements remain intact and unescaped?

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use HTML::Element; use warnings; use strict; my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->no_space_compacting(1); $xhtml->parse_file(\*DATA) or die("Could not parse file handle for 'DATA' : $!\n"); for my $item ($xhtml->findnodes('//div/ul/li')) { my $li = $item->as_XML; $li =~ s/^\s+//; # ... omitting rest of the stuff which happens to $li ... my $new = HTML::Element->new('~literal', 'text' => $li); $item->replace_with($new); } print $xhtml->as_XML_indented; $xhtml->delete; exit(0); __DATA__ <html> <head> <title>Foo Bar</title> </head> <body> <div><a href=" http://foo.example.com/ ">Foo Bar</a> <ul> <li> foo foo foo foo <em>bar</em> foo foo foo foo foo </li></ul></div> <div><a href=" http://bar.example.com/ ">Bar Foo</a> <ul> <li> foo foo foo foo <em>bar</em> foo foo foo foo foo <ul> <li>alpha</li> <li>b<em>et</em>a</li> <li>gamma</li> </ul> </li></ul></div> </body> </html>

The output I get is as follows:

<html> <head> <title>Foo Bar</title> </head> <body> <div><a href=" http://foo.example.com/ ">Foo Bar</a> <ul>&lt;li&gt; foo foo foo foo &lt;em&gt;bar&lt;/em&gt; foo foo foo foo foo &lt;/li&gt; </ul> </div> <div><a href=" http://bar.example.com/ ">Bar Foo</a> <ul>&lt;li&gt; foo foo foo foo &lt;em&gt;bar&lt;/em&gt; foo foo foo foo foo &lt;ul&gt;&lt;li&gt;alpha&lt;/li&gt;&lt;li&gt;b&lt;em&gt;et&lt;/e +m&gt;a&lt;/li&gt;&lt;li&gt;gamma&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt; </ul> </div> </body> </html>

The output I would like to get instead would look like this:

<html> <head> <title>Foo Bar</title> </head> <body> <div><a href=" http://foo.example.com/ ">Foo Bar</a> <ul><li>foo foo foo foo <em>bar</em> foo foo foo foo foo </li> </ul>HTML::TreeBuilder::XPath </div> <div><a href=" http://bar.example.com/ ">Bar Foo</a> <ul><li>foo foo foo foo <em>bar</em> foo foo foo foo foo <ul><li>alpha</li><li>b<em>et</em>a</li><li>gamma</li></ul></li> </ul> </div> </body> </html>

I'm not sure if HTML::TreeBuilder::XPath can be made to work like that. If it can, what has to change?

Replies are listed 'Best First'.
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element
by haukex (Bishop) on Nov 15, 2021 at 10:39 UTC
    # ... omitting rest of the stuff which happens to $li ...

    This is actually the important bit. A DOM tree is a tree of objects, so in your call to $item->replace_with($new);, $new needs to be a tree of objects representing the HTML that you want to insert, not just a single text node. One would normally do this by directly manipulating the objects in the tree, or building a new subtree to replace the old one. But you haven't told us what manipulations you wish to do, so it's difficult to make a more specific recommendation. Your expected output is identical to your input except for whitespace changes (and the insertion of "HTML::TreeBuilder::XPath", which I am guessing might be a mistake), but because whitespace is insignificant in many places in HTML/XML, I can't tell what manipulations you might want to make here.

Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element
by marto (Cardinal) on Nov 15, 2021 at 10:44 UTC

      Thanks haukex and marto. In this case, I just want to trim the unnecessary white space from the start and end of a few elements and attributes. The attributes are easy to work with so that is solved. However, I am not sure how to apply a substitution, s///, to an element containing more that just text.

        However, I am not sure how to apply a substitution, s///, to an element containing more that just text.

        The documentation of HTML::Element's content_refs_list gives an example of how to modify text nodes contained in an element and the documentation of HTML::Element::traverse shows how to use a recursive function to walk the tree. Putting those together:

        sub html_trim { my $elem = shift; for my $itemref ($elem->content_refs_list) { if ( ref $$itemref ) { html_trim($$itemref) } # remove this for non-recursive else { $$itemref =~ s/^\s+|\s+$//g } } } for my $elem ($xhtml->findnodes('//div/ul/li')) { html_trim($elem) }
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element
by tangent (Vicar) on Nov 15, 2021 at 14:19 UTC
    Many of the tools used to parse HTML use HTML::Parser under the hood, and it is worthwhile knowing how it works. This script gathers up all the content of each list item, including other elements, into a variable. When it meets the closing list item tag, you can do what you need to the content before printing it out.
    use HTML::Parser; my $inside_li = 0; my $list_item = ''; sub start { my ($tag, $text) = @_; if ($inside_li) { $list_item .= $text; return; } if ($tag eq 'li') { $inside_li = 1; } print $text; }; sub text { my ($text) = @_; if ($inside_li) { $list_item .= $text; return; } print $text; }; sub end { my ($tag, $text) = @_; if ($tag eq 'li') { $inside_li = 0; # do things to <li> content $list_item =~ s/^\s+//; print $list_item; $list_item = ''; } if ($inside_li) { $list_item .= $text; return; } print $text; }; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, text"], text_h => [\&text, "text"], end_h => [\&end, "tagname, text"], default_h => [\&text, "text"], ); $parser->parse_file(\*DATA);
Re: Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element
by Anonymous Monk on Nov 15, 2021 at 12:45 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11138822]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2022-05-26 18:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (93 votes). Check out past polls.

    Notices?