Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Possible to treat an HTML::TreeBuilder object as a filehandle?

by jms53 (Monk)
on Feb 11, 2014 at 09:03 UTC ( #1074367=perlquestion: print w/ replies, xml ) Need Help??
jms53 has asked for the wisdom of the Perl Monks concerning the following question:

I'm using  HTML::TreeBuilder to extract data from an HTML table. The obvious solution would be to use  HTML::TableExtract which would work if my data was organized nicely.
My data fits entirely in one cell, with titles in  span tags as follows:

<span> Author_name </span> __filler__ <ul> <li> book 1 by Author_name </li> <li> book 2 by Author_name </li> <ul> <span> New_Author </span> __filler__ <ul>.....

By using the look_down method, I was able to create a list of authors, and a list of books. What I have not been able to do, is assign a book to a given author, in order to sort for example by number of published books.

I thought of doing something as follows:

my %publications = $tree->look_down( _tag => 'span'); open my $fh , "<", $tree->as_HTML; while (<$fh>) { my $author; next unless ($_ =~ /(?:span|li)/; $_ =~ /span/ ? $author =~ /\>(.+)\<\/span/ : push @publications{$aut +hor}, /li\>(.+)\<\/li/; }

Maybe I'm barking up the wrong tree? Thank you for the input!

J -

Comment on Possible to treat an HTML::TreeBuilder object as a filehandle?
Select or Download Code
Re: Possible to treat an HTML::TreeBuilder object as a filehandle? (perldoc -f open)
by Anonymous Monk on Feb 11, 2014 at 09:16 UTC

    perldoc -f open

    $ perl -le " open my($fh), q{<}, \q{hi}; print readline $fh; " hi

    open $handle, $mode, \$string;

      The filehandle doesn't kill the program my script anymore, however, I'm no closer to treating the HTML tree as a filehandle. I'm getting the following output:

      readline () on closed filehandle $fh at project12.pl on line 48. readline () on closed filehandle $fh at project12.pl on line 48. readline () on closed filehandle $fh at project12.pl on line 48. readline () on closed filehandle $fh at project12.pl on line 48.
      J -
        What program?
        $ perl -wle " open my($fh), q{<}, \q{hi}; print readline $fh; close +$fh; print readline $fh; " hi readline() on closed filehandle $fh at -e line 1.
Re: Possible to treat an HTML::TreeBuilder object as a filehandle?
by Util (Priest) on Feb 13, 2014 at 21:40 UTC

    You are barking up the wrong tree.

    You could make that approach work correctly, but taking data that has already been parsed (by HTML::TreeBuilder in this case), dumping it to an unparsed format (via as_HTML), and reparsing it (via regexes), is a red flag.

    Even if it was not a bad idea in general, as_HTML does not always output the one-tag-per-line format that your code would need.

    Your task is complicated by the UL&LI tags not occurring within the SPAN tag. By the time you are processing a LI tag, the author in the previous SPAN tag cannot be directly accessed, since the SPAN is before the LI, but not a parent of LI.

    Your impulse to iterate over the tags is good. The "my $author;" line would have to be outside the while() loop, though.

    find_by_tag_name() accepts multiple tag names, and so will do what you need.

    Working, tested code:

    #!/usr/bin/env perl use strict; use warnings; use HTML::TreeBuilder; use Data::Dumper; $Data::Dumper::Sortkeys = 1; my $tree = HTML::TreeBuilder->new; $tree->parse( <<'END_OF_HTML' ); <span> Author_name </span> __filler__ <ul> <li> book 1 by Author_name </li> <li> book 2 by Author_name </li> </ul> <span> New_Author </span> __filler__ <ul> <li> book 1 by new </li> </ul> END_OF_HTML $tree->eof; # Uncomment to show that as_HTML is a bad fit for this task. # open my $fh , '<', \( $tree->as_HTML('', ' ') ) or die; # print $_ while <$fh>; # exit; my @tags = $tree->find_by_tag_name( qw( span li ) ); my $current_author; my %book_author; my %author_books_HoA; for my $t (@tags) { my $tag_name = $t->tag; if ( $tag_name eq 'span' ) { $current_author = $t->as_trimmed_text; } elsif ( $tag_name eq 'li' ) { next unless $t->parent->tag eq 'ul'; my $book_title = $t->as_trimmed_text; warn if exists $book_author{$book_title}; $book_author{$book_title} = $current_author; push @{ $author_books_HoA{$current_author} }, $book_title; } else { die "Unexpected tag $tag_name" } } print Dumper \%book_author, \%author_books_HoA;

    Output:

    $VAR1 = { 'book 1 by Author_name' => 'Author_name', 'book 1 by new' => 'New_Author', 'book 2 by Author_name' => 'Author_name' }; $VAR2 = { 'Author_name' => [ 'book 1 by Author_name', 'book 2 by Author_name' ], 'New_Author' => [ 'book 1 by new' ] };

    /em

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1074367]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2014-08-30 05:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls