Inspecting each element in a tree, specifically HTML::Tree

hulot has asked for the wisdom of the Perl Monks concerning the following question:

I've only used Perl for 2 weeks, so apologies for my likely ignorance.

I want to write a script that downloads a webpage and renders part of that page in wiki format. I appreciate that there is a HTML:WikiConverter module, but I would like to implement this myself, partly because I only want to render some elements of the html. I will be using HTML::Tree.

The first step is to build the tree. That appears straightforward:

#!/usr/bin/perl -w
use HTML::Tree;
use LWP::Simple;
use strict;

getstore ("http://www.guardian.co.uk", "guardian.htm") or die "Cannot 
+get the page.\n";

my $tree = HTML::TreeBuilder->new();
$tree = parse_file("guardian.htm);
[download]

In pseudo (pseudo) code I wish to look at each element of the page. For each element, if the tag is one I'm interested in, then I wish to take the text of the element and render it to wiki format.

I just don't understand how to loop through all the elements. A discussion in the HTML::Tree documentation suggests a recursive method of accessing all the elements:

 {
    my $counter = 'x0000';
    sub give_id {
      my $x = $_[0];
      $x->attr('id', $counter++) unless defined $x->attr('id');
      foreach my $c ($x->content_list) {
        give_id($c) if ref $c; # ignore text nodes
      }
    };
    give_id($start_node);
  }
[download]

But I don't understand this code and can't adapt it.

Once I have a 'loop' method of looking at each element I propose processing them like this:

 if $element->teg('h1' or 'h2')
{
my $content = $element->as_text();
print outfile "====$content====\n";
}
[download]

I will have several elsif statements doing something similar with other tags.

My question then is how can write a loop that allows me to look at each element in the tree. (The traverse method is deprecated.)

Comment on Inspecting each element in a tree, specifically HTML::Tree Select or Download Code

Replies are listed 'Best First'.
Re: Inspecting each element in a tree, specifically HTML::Tree by tobyink (Canon) on Jul 24, 2012 at 11:52 UTC
The recursive example is quite a simple and good one. If you don't understand it then you should read up on the topic of recursion. It's quite a widely used concept in programming and it can make things that would otherwise be very hard to program, quite easy and concise. Developing a good understanding of what recursion is, how/why it works, and when it's a good idea to use it will help you not just solve your current problem, but become a better programmer. Wikipedia has a reasonably good article on recursion. If this helps, here's a slightly rewritten version of your recursive function: `sub do_something_recursively { my ($element) = @_; # Here we will do something just to $element # without worrying about recursion at all. $element->set_class("processed"); # Now we loop through each child element foreach my $child ($element->content_list) { # Skip text nodes next unless ref $child; # And we call this function on the child do_something_recursively($child); } } do_something_recursively($root_element);` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re: Inspecting each element in a tree, specifically HTML::Tree by daxim (Curate) on Jul 24, 2012 at 16:32 UTC
With better tools it becomes easier. No explicit recursion here, I just say what I want. `use Web::Query 'wq'; my $w = wq('http://www.guardian.co.uk'); my @headings = $w->find('h1')->text; # ( # "Coulson and Brooks face phone hacking prosecution", # " Usain Bolt: 'A lot of legends have come before me - but this +is my time' ", # )` [download]	[reply] [d/l]
Re^2: Inspecting each element in a tree, specifically HTML::Tree by tobyink (Canon) on Jul 25, 2012 at 00:16 UTC
`use 5.010; use Web::Magic -quotelike => w; w(http://www.guardian.co.uk/) -> querySelectorAll('h1 a') -> foreach(sub { say $_->textContent })` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re: Inspecting each element in a tree, specifically HTML::Tree by Anonymous Monk on Jul 25, 2012 at 03:44 UTC
I've only used Perl for 2 weeks, so apologies for my likely ignorance. You're so close its not even funny, just use look_down to examine every element. Read all about it in HTML::Tree/HTML::Tree::Scanning Or go the declarative route (xpath), use HTML::TreeBuilder::XPath/htmltreexpather.pl, see examples and walkthroughs: Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day ....	[reply]
Re^2: Inspecting each element in a tree, specifically HTML::Tree by hulot (Initiate) on Jul 25, 2012 at 19:07 UTC
Thanks for all the useful comments. I have got the grips with this instance of recursion and got the original HTML::Tree method to work. I will also take a look at Web::Query.	[reply]

Back to Seekers of Perl Wisdom