http://www.perlmonks.org?node_id=983379

hulot has asked for the wisdom of the Perl Monks concerning the following question:

I've only used Perl for 2 weeks, so apologies for my likely ignorance.

I want to write a script that downloads a webpage and renders part of that page in wiki format. I appreciate that there is a HTML:WikiConverter module, but I would like to implement this myself, partly because I only want to render some elements of the html. I will be using HTML::Tree.

The first step is to build the tree. That appears straightforward:

#!/usr/bin/perl -w use HTML::Tree; use LWP::Simple; use strict; getstore ("http://www.guardian.co.uk", "guardian.htm") or die "Cannot +get the page.\n"; my $tree = HTML::TreeBuilder->new(); $tree = parse_file("guardian.htm);

In pseudo (pseudo) code I wish to look at each element of the page. For each element, if the tag is one I'm interested in, then I wish to take the text of the element and render it to wiki format.

I just don't understand how to loop through all the elements. A discussion in the HTML::Tree documentation suggests a recursive method of accessing all the elements:

{ my $counter = 'x0000'; sub give_id { my $x = $_[0]; $x->attr('id', $counter++) unless defined $x->attr('id'); foreach my $c ($x->content_list) { give_id($c) if ref $c; # ignore text nodes } }; give_id($start_node); }
But I don't understand this code and can't adapt it.

Once I have a 'loop' method of looking at each element I propose processing them like this:

if $element->teg('h1' or 'h2') { my $content = $element->as_text(); print outfile "====$content====\n"; }

I will have several elsif statements doing something similar with other tags.

My question then is how can write a loop that allows me to look at each element in the tree. (The traverse method is deprecated.)

Replies are listed 'Best First'.
Re: Inspecting each element in a tree, specifically HTML::Tree
by tobyink (Canon) on Jul 24, 2012 at 11:52 UTC

    The recursive example is quite a simple and good one. If you don't understand it then you should read up on the topic of recursion. It's quite a widely used concept in programming and it can make things that would otherwise be very hard to program, quite easy and concise. Developing a good understanding of what recursion is, how/why it works, and when it's a good idea to use it will help you not just solve your current problem, but become a better programmer.

    Wikipedia has a reasonably good article on recursion.

    If this helps, here's a slightly rewritten version of your recursive function:

    sub do_something_recursively { my ($element) = @_; # Here we will do something just to $element # without worrying about recursion at all. $element->set_class("processed"); # Now we loop through each child element foreach my $child ($element->content_list) { # Skip text nodes next unless ref $child; # And we call *this function* on the child do_something_recursively($child); } } do_something_recursively($root_element);
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Inspecting each element in a tree, specifically HTML::Tree
by daxim (Curate) on Jul 24, 2012 at 16:32 UTC
    With better tools it becomes easier. No explicit recursion here, I just say what I want.

    use Web::Query 'wq'; my $w = wq('http://www.guardian.co.uk'); my @headings = $w->find('h1')->text; # ( # "Coulson and Brooks face phone hacking prosecution", # " Usain Bolt: 'A lot of legends have come before me - but this +is my time' ", # )
      use 5.010; use Web::Magic -quotelike => w; w(http://www.guardian.co.uk/) -> querySelectorAll('h1 a') -> foreach(sub { say $_->textContent })
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Inspecting each element in a tree, specifically HTML::Tree
by Anonymous Monk on Jul 25, 2012 at 03:44 UTC

      Thanks for all the useful comments.

      I have got the grips with this instance of recursion and got the original HTML::Tree method to work. I will also take a look at Web::Query.