Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Possible to treat an HTML::TreeBuilder object as a filehandle?

by Util (Priest)
on Feb 13, 2014 at 21:40 UTC ( #1074900=note: print w/replies, xml ) Need Help??


in reply to Possible to treat an HTML::TreeBuilder object as a filehandle?

You are barking up the wrong tree.

You could make that approach work correctly, but taking data that has already been parsed (by HTML::TreeBuilder in this case), dumping it to an unparsed format (via as_HTML), and reparsing it (via regexes), is a red flag.

Even if it was not a bad idea in general, as_HTML does not always output the one-tag-per-line format that your code would need.

Your task is complicated by the UL&LI tags not occurring within the SPAN tag. By the time you are processing a LI tag, the author in the previous SPAN tag cannot be directly accessed, since the SPAN is before the LI, but not a parent of LI.

Your impulse to iterate over the tags is good. The "my $author;" line would have to be outside the while() loop, though.

find_by_tag_name() accepts multiple tag names, and so will do what you need.

Working, tested code:

#!/usr/bin/env perl use strict; use warnings; use HTML::TreeBuilder; use Data::Dumper; $Data::Dumper::Sortkeys = 1; my $tree = HTML::TreeBuilder->new; $tree->parse( <<'END_OF_HTML' ); <span> Author_name </span> __filler__ <ul> <li> book 1 by Author_name </li> <li> book 2 by Author_name </li> </ul> <span> New_Author </span> __filler__ <ul> <li> book 1 by new </li> </ul> END_OF_HTML $tree->eof; # Uncomment to show that as_HTML is a bad fit for this task. # open my $fh , '<', \( $tree->as_HTML('', ' ') ) or die; # print $_ while <$fh>; # exit; my @tags = $tree->find_by_tag_name( qw( span li ) ); my $current_author; my %book_author; my %author_books_HoA; for my $t (@tags) { my $tag_name = $t->tag; if ( $tag_name eq 'span' ) { $current_author = $t->as_trimmed_text; } elsif ( $tag_name eq 'li' ) { next unless $t->parent->tag eq 'ul'; my $book_title = $t->as_trimmed_text; warn if exists $book_author{$book_title}; $book_author{$book_title} = $current_author; push @{ $author_books_HoA{$current_author} }, $book_title; } else { die "Unexpected tag $tag_name" } } print Dumper \%book_author, \%author_books_HoA;

Output:

$VAR1 = { 'book 1 by Author_name' => 'Author_name', 'book 1 by new' => 'New_Author', 'book 2 by Author_name' => 'Author_name' }; $VAR2 = { 'Author_name' => [ 'book 1 by Author_name', 'book 2 by Author_name' ], 'New_Author' => [ 'book 1 by new' ] };

/em

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1074900]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2020-05-28 20:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (166 votes). Check out past polls.

    Notices?