hulot has asked for the wisdom of the Perl Monks concerning the following question:
I've only used Perl for 2 weeks, so apologies for my likely ignorance.
I want to write a script that downloads a webpage and renders part of that page in wiki format. I appreciate that there is a HTML:WikiConverter module, but I would like to implement this myself, partly because I only want to render some elements of the html. I will be using HTML::Tree.
The first step is to build the tree. That appears straightforward:
#!/usr/bin/perl -w use HTML::Tree; use LWP::Simple; use strict; getstore ("http://www.guardian.co.uk", "guardian.htm") or die "Cannot +get the page.\n"; my $tree = HTML::TreeBuilder->new(); $tree = parse_file("guardian.htm);
In pseudo (pseudo) code I wish to look at each element of the page. For each element, if the tag is one I'm interested in, then I wish to take the text of the element and render it to wiki format.
I just don't understand how to loop through all the elements. A discussion in the HTML::Tree documentation suggests a recursive method of accessing all the elements:
But I don't understand this code and can't adapt it.{ my $counter = 'x0000'; sub give_id { my $x = $_[0]; $x->attr('id', $counter++) unless defined $x->attr('id'); foreach my $c ($x->content_list) { give_id($c) if ref $c; # ignore text nodes } }; give_id($start_node); }
Once I have a 'loop' method of looking at each element I propose processing them like this:
if $element->teg('h1' or 'h2') { my $content = $element->as_text(); print outfile "====$content====\n"; }
I will have several elsif statements doing something similar with other tags.
My question then is how can write a loop that allows me to look at each element in the tree. (The traverse method is deprecated.)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Inspecting each element in a tree, specifically HTML::Tree
by tobyink (Canon) on Jul 24, 2012 at 11:52 UTC | |
Re: Inspecting each element in a tree, specifically HTML::Tree
by daxim (Curate) on Jul 24, 2012 at 16:32 UTC | |
by tobyink (Canon) on Jul 25, 2012 at 00:16 UTC | |
Re: Inspecting each element in a tree, specifically HTML::Tree
by Anonymous Monk on Jul 25, 2012 at 03:44 UTC | |
by hulot (Initiate) on Jul 25, 2012 at 19:07 UTC |