Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Possible to treat an HTML::TreeBuilder object as a filehandle?

by jms53 (Monk)
on Feb 11, 2014 at 09:03 UTC ( #1074367=perlquestion: print w/ replies, xml ) Need Help??
jms53 has asked for the wisdom of the Perl Monks concerning the following question:

I'm using  HTML::TreeBuilder to extract data from an HTML table. The obvious solution would be to use  HTML::TableExtract which would work if my data was organized nicely.
My data fits entirely in one cell, with titles in  span tags as follows:

<span> Author_name </span> __filler__ <ul> <li> book 1 by Author_name </li> <li> book 2 by Author_name </li> <ul> <span> New_Author </span> __filler__ <ul>.....

By using the look_down method, I was able to create a list of authors, and a list of books. What I have not been able to do, is assign a book to a given author, in order to sort for example by number of published books.

I thought of doing something as follows:

my %publications = $tree->look_down( _tag => 'span'); open my $fh , "<", $tree->as_HTML; while (<$fh>) { my $author; next unless ($_ =~ /(?:span|li)/; $_ =~ /span/ ? $author =~ /\>(.+)\<\/span/ : push @publications{$aut +hor}, /li\>(.+)\<\/li/; }

Maybe I'm barking up the wrong tree? Thank you for the input!

J -

Comment on Possible to treat an HTML::TreeBuilder object as a filehandle?
Select or Download Code
Replies are listed 'Best First'.
Re: Possible to treat an HTML::TreeBuilder object as a filehandle? (perldoc -f open)
by Anonymous Monk on Feb 11, 2014 at 09:16 UTC

    perldoc -f open

    $ perl -le " open my($fh), q{<}, \q{hi}; print readline $fh; " hi

    open $handle, $mode, \$string;

      The filehandle doesn't kill the program my script anymore, however, I'm no closer to treating the HTML tree as a filehandle. I'm getting the following output:

      readline () on closed filehandle $fh at project12.pl on line 48. readline () on closed filehandle $fh at project12.pl on line 48. readline () on closed filehandle $fh at project12.pl on line 48. readline () on closed filehandle $fh at project12.pl on line 48.
      J -
        What program?
        $ perl -wle " open my($fh), q{<}, \q{hi}; print readline $fh; close +$fh; print readline $fh; " hi readline() on closed filehandle $fh at -e line 1.
Re: Possible to treat an HTML::TreeBuilder object as a filehandle?
by Util (Priest) on Feb 13, 2014 at 21:40 UTC

    You are barking up the wrong tree.

    You could make that approach work correctly, but taking data that has already been parsed (by HTML::TreeBuilder in this case), dumping it to an unparsed format (via as_HTML), and reparsing it (via regexes), is a red flag.

    Even if it was not a bad idea in general, as_HTML does not always output the one-tag-per-line format that your code would need.

    Your task is complicated by the UL&LI tags not occurring within the SPAN tag. By the time you are processing a LI tag, the author in the previous SPAN tag cannot be directly accessed, since the SPAN is before the LI, but not a parent of LI.

    Your impulse to iterate over the tags is good. The "my $author;" line would have to be outside the while() loop, though.

    find_by_tag_name() accepts multiple tag names, and so will do what you need.

    Working, tested code:

    #!/usr/bin/env perl use strict; use warnings; use HTML::TreeBuilder; use Data::Dumper; $Data::Dumper::Sortkeys = 1; my $tree = HTML::TreeBuilder->new; $tree->parse( <<'END_OF_HTML' ); <span> Author_name </span> __filler__ <ul> <li> book 1 by Author_name </li> <li> book 2 by Author_name </li> </ul> <span> New_Author </span> __filler__ <ul> <li> book 1 by new </li> </ul> END_OF_HTML $tree->eof; # Uncomment to show that as_HTML is a bad fit for this task. # open my $fh , '<', \( $tree->as_HTML('', ' ') ) or die; # print $_ while <$fh>; # exit; my @tags = $tree->find_by_tag_name( qw( span li ) ); my $current_author; my %book_author; my %author_books_HoA; for my $t (@tags) { my $tag_name = $t->tag; if ( $tag_name eq 'span' ) { $current_author = $t->as_trimmed_text; } elsif ( $tag_name eq 'li' ) { next unless $t->parent->tag eq 'ul'; my $book_title = $t->as_trimmed_text; warn if exists $book_author{$book_title}; $book_author{$book_title} = $current_author; push @{ $author_books_HoA{$current_author} }, $book_title; } else { die "Unexpected tag $tag_name" } } print Dumper \%book_author, \%author_books_HoA;

    Output:

    $VAR1 = { 'book 1 by Author_name' => 'Author_name', 'book 1 by new' => 'New_Author', 'book 2 by Author_name' => 'Author_name' }; $VAR2 = { 'Author_name' => [ 'book 1 by Author_name', 'book 2 by Author_name' ], 'New_Author' => [ 'book 1 by new' ] };

    /em

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1074367]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (14)
As of 2015-07-30 19:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (273 votes), past polls