Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Inspecting each element in a tree, specifically HTML::Tree

by hulot (Initiate)
on Jul 24, 2012 at 11:36 UTC ( #983379=perlquestion: print w/ replies, xml ) Need Help??
hulot has asked for the wisdom of the Perl Monks concerning the following question:

I've only used Perl for 2 weeks, so apologies for my likely ignorance.

I want to write a script that downloads a webpage and renders part of that page in wiki format. I appreciate that there is a HTML:WikiConverter module, but I would like to implement this myself, partly because I only want to render some elements of the html. I will be using HTML::Tree.

The first step is to build the tree. That appears straightforward:

#!/usr/bin/perl -w use HTML::Tree; use LWP::Simple; use strict; getstore ("http://www.guardian.co.uk", "guardian.htm") or die "Cannot +get the page.\n"; my $tree = HTML::TreeBuilder->new(); $tree = parse_file("guardian.htm);

In pseudo (pseudo) code I wish to look at each element of the page. For each element, if the tag is one I'm interested in, then I wish to take the text of the element and render it to wiki format.

I just don't understand how to loop through all the elements. A discussion in the HTML::Tree documentation suggests a recursive method of accessing all the elements:

{ my $counter = 'x0000'; sub give_id { my $x = $_[0]; $x->attr('id', $counter++) unless defined $x->attr('id'); foreach my $c ($x->content_list) { give_id($c) if ref $c; # ignore text nodes } }; give_id($start_node); }
But I don't understand this code and can't adapt it.

Once I have a 'loop' method of looking at each element I propose processing them like this:

if $element->teg('h1' or 'h2') { my $content = $element->as_text(); print outfile "====$content====\n"; }

I will have several elsif statements doing something similar with other tags.

My question then is how can write a loop that allows me to look at each element in the tree. (The traverse method is deprecated.)

Comment on Inspecting each element in a tree, specifically HTML::Tree
Select or Download Code
Replies are listed 'Best First'.
Re: Inspecting each element in a tree, specifically HTML::Tree
by tobyink (Abbot) on Jul 24, 2012 at 11:52 UTC

    The recursive example is quite a simple and good one. If you don't understand it then you should read up on the topic of recursion. It's quite a widely used concept in programming and it can make things that would otherwise be very hard to program, quite easy and concise. Developing a good understanding of what recursion is, how/why it works, and when it's a good idea to use it will help you not just solve your current problem, but become a better programmer.

    Wikipedia has a reasonably good article on recursion.

    If this helps, here's a slightly rewritten version of your recursive function:

    sub do_something_recursively { my ($element) = @_; # Here we will do something just to $element # without worrying about recursion at all. $element->set_class("processed"); # Now we loop through each child element foreach my $child ($element->content_list) { # Skip text nodes next unless ref $child; # And we call *this function* on the child do_something_recursively($child); } } do_something_recursively($root_element);
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Inspecting each element in a tree, specifically HTML::Tree
by daxim (Chaplain) on Jul 24, 2012 at 16:32 UTC
    With better tools it becomes easier. No explicit recursion here, I just say what I want.

    use Web::Query 'wq'; my $w = wq('http://www.guardian.co.uk'); my @headings = $w->find('h1')->text; # ( # "Coulson and Brooks face phone hacking prosecution", # " Usain Bolt: 'A lot of legends have come before me - but this +is my time' ", # )
      use 5.010; use Web::Magic -quotelike => w; w(http://www.guardian.co.uk/) -> querySelectorAll('h1 a') -> foreach(sub { say $_->textContent })
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Inspecting each element in a tree, specifically HTML::Tree
by Anonymous Monk on Jul 25, 2012 at 03:44 UTC

      Thanks for all the useful comments.

      I have got the grips with this instance of recursion and got the original HTML::Tree method to work. I will also take a look at Web::Query.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://983379]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (14)
As of 2015-07-29 11:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (263 votes), past polls