Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

HTML::TreeBuilder scan for first table

by mazdajai (Novice)
on Jan 21, 2016 at 23:07 UTC ( [id://1153329]=perlquestion: print w/replies, xml ) Need Help??

mazdajai has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find a way to get the first table header on the below HTML code. If I use look_down and search for td and HeaderTitle, it will also get the second table. Any suggestion? Expected output:
$VAR1 = [ 'Status', 'Results', 'Schedule Start', 'Actual Start', 'Schedule Name', 'Node Name', 'Domain Name', ];
Current output:
$VAR1 = [ 'á', 'Status', 'Results', 'Schedule Start', 'Actual Start', 'Schedule Name', 'Node Name', 'Domain Name', 'á', 'Node Name', 'Node Version', 'OS Platform', 'OS Version', 'Activity', 'Bytes Transferred' ];
Code:
use HTML::TreeBuilder; use Data::Dumper; use 5.16.0; my $h = HTML::TreeBuilder->new; $h->parse_content( do{ local $/; <DATA> } ); my @headers = map @{ $_->content }, ($h->look_down ( _tag => 'td', class => qr/HeaderTitle\b?/ , ) ) ; print Dumper \@headers; __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> <meta name="GENERATOR" content="TSM Operational Reporting"> <meta name="ProgId" content="FrontPage.Editor.Document"> <title>TSM Operational Reporting</title> </head> <DIV class=HeaderBar>Daily Report TSM 24 hour Report for TSM2T generat +ed at 2016-01-18 09:00:14 on NJ covering 2016-01-17 09:00:14 to 2016- +01-18 09:00:13</DIV> <body> <table border="0" width="100%%"> <DIV class=FooterBar>Server name: <a href="http://10.1.2.2:1980"> TSM< +/a>, platform: Linux/ppc64, version: 7.1.3.0, date/time: 01/18/2016 0 +8:55:01</DIV> <tr><td width="100%"><p> <DIV class=HeaderBar>Client Schedules</DIV> <TABLE class=HeaderFrame height=100 cellSpacing=0 cols=3 cellPadding=0 + width="100%" border=0 align="left"> <TR vAlign=top height=100> <TD vAlign=top width="100%" height="100"> <DIV style="overflow: auto; width: "100%"; height: 200; valign: +top"> <TABLE cellSpacing=0 cols=4 cellPadding=0 width="100%" border=0 +height="100"> <TR height=25 nowrap> <TD class=HeaderTitleNoVLine height="14" width="10">&nbsp;</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Status</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Results< +/TD> <TD class=HeaderTitle noWrap align=left height="14">Schedule + Start</TD> <TD class=HeaderTitle noWrap align=left height="14">Actual S +tart</TD> <TD class=HeaderTitle noWrap align=left height="14">Schedule + Name</TD> <TD class=HeaderTitle noWrap align=left height="14">Node Nam +e</TD> <TD class=HeaderTitle noWrap align=left height="14">Domain N +ame</TD></TR> <TR class=AltLight height=22> <TD class=AltLightNoVline align=middle height="17" width="10 +"> </TD> <TD class=AltLight align=left height="17">Completed</TD> <TD class=AltLight align=left height="17">Successful</TD> <TD class=AltLight align=left height="17">2016-01-17-17.00</ +TD> <TD class=AltLight align=left height="17">2016-01-17-17.09</ +TD> <TD class=AltLight align=left height="17">NJDLYBACKUP_5PM</T +D> <TD class=AltLight align=left height="17">APX23</TD> <TD class=AltLight align=left height="17">ST15_DOMAIN</TD></ +TR> <TR class=AltDark height=22> <TD class=AltLightNoVline align=middle height="17" width="10"> + </TD> <TD class=AltLight align=left height="17">Missed</TD> <TD class=AltLight align=left height="17">Successful</TD> <TD class=AltLight align=left height="17">2016-01-17-17.00</ +TD> <TD class=AltLight align=left height="17">2016-01-17-17.09</ +TD> <TD class=AltLight align=left height="17">NJDLYBACKUP_5PM</T +D> <TD class=AltLight align=left height="17">APX24</TD> <TD class=AltLight align=left height="17">ST15_DOMAIN</TD></ +TR> </TABLE> </DIV></TD> </TR></TABLE> </td> </tr> <tr><td width="100%"><p> <DIV class=HeaderBar>Node Activity Summary</DIV> <TABLE class=HeaderFrame height=100 cellSpacing=0 cols=3 cellPadding=0 + width="100%" border=0 align="left"> <TR vAlign=top height=100> <TD vAlign=top width="100%" height="100"> <DIV style="overflow: auto; width: "100%"; height: 200; valign: +top"> <TABLE cellSpacing=0 cols=4 cellPadding=0 width="100%" border=0 +height="100"> <TR height=25 nowrap> <TD class=HeaderTitleNoVLine height="14" width="10">&nbsp;</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Node Nam +e</TD> <TD class=HeaderTitle noWrap align=left height="14">Node Ver +sion</TD> <TD class=HeaderTitle noWrap align=left height="14">OS Platf +orm</TD> <TD class=HeaderTitle noWrap align=left height="14">OS Versi +on</TD> <TD class=HeaderTitle noWrap align=left height="14">Activity +</TD> <TD class=HeaderTitle noWrap align=left height="14">Bytes Tr +ansferred</TD></TR> <TR class=AltLight height=22> <TD class=AltLightNoVline align=middle height="17" width="10 +"> </TD> <TD class=AltLight align=left height="17">RDFXDB11</TD> <TD class=AltLight align=left height="17">7.1.0.0</TD> <TD class=AltLight align=left height="17">WinNT</TD> <TD class=AltLight align=left height="17">6.01</TD> <TD class=AltLight align=left height="17">BACKUP</TD> <TD class=AltLight align=left height="17">105,806,011,655</T +D></TR> <TR class=AltDark height=22> </TABLE> </DIV></TD> </TR></TABLE>

Replies are listed 'Best First'.
Re: HTML::TreeBuilder scan for first table
by jeffa (Bishop) on Jan 21, 2016 at 23:40 UTC

    Try another module like HTML::TableExtract?

    use strict; use warnings; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new( decode => 0 ); $te->parse( do{ local $/; <DATA> } ); my ($first) = $te->tables; print Dumper (($first->rows)[0]); __DATA__ *insert your HTML here*

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: HTML::TreeBuilder scan for first table
by Mr. Muskrat (Canon) on Jan 21, 2016 at 23:29 UTC

    How is look_down supposed to know that you don't want the second table when it is inside of the first matches the search criteria?

    Update: Removed HTML and updated statement.

      The desired table is nested in an outer table.

      IMO the way to go is to get all the tables into an array (using look_down). The array of tables will have an entry for each table. As far as I know in order. So choose entry 1. (The first entry is the outer table.)

      use HTML::TreeBuilder; use Data::Dumper; use 5.16.0; my $h = HTML::TreeBuilder->new; $h->parse_content( do{ local $/; <DATA> } ); my @tables = $h->look_down('_tag' => 'table'); # The desired table is the firet table nested in the outer table. # $tables[0] is the outer table and $tables[1] is the first nested my $table = $tables[1]; my @headers = map { $_->content } $table->look_down( _tag => 't +d', class => qr/HeaderTitle\b?/ ); print Dumper \@headers; __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> <meta name="GENERATOR" content="TSM Operational Reporting"> <meta name="ProgId" content="FrontPage.Editor.Document"> <title>TSM Operational Reporting</title> </head> <DIV class=HeaderBar>Daily Report TSM 24 hour Report for TSM2T generat +ed at 2016-01-18 09:00:14 on NJ covering 2016-01-17 09:00:14 to 2016- +01-18 09:00:13</DIV> <body> <table border="0" width="100%%"> <DIV class=FooterBar>Server name: <a href="http://10.1.2.2:1980"> TSM< +/a>, platform: Linux/ppc64, version: 7.1.3.0, date/time: 01/18/2016 0 +8:55:01</DIV> <tr><td width="100%"><p> <DIV class=HeaderBar>Client Schedules</DIV> <TABLE class=HeaderFrame height=100 cellSpacing=0 cols=3 cellPadding=0 + width="100%" border=0 align="left"> <TR vAlign=top height=100> <TD vAlign=top width="100%" height="100"> <DIV style="overflow: auto; width: "100%"; height: 200; valign: +top"> <TABLE cellSpacing=0 cols=4 cellPadding=0 width="100%" border=0 +height="100"> <TR height=25 nowrap> <TD class=HeaderTitleNoVLine height="14" width="10">&nbsp;</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Status</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Results< +/TD> <TD class=HeaderTitle noWrap align=left height="14">Schedule + Start</TD> <TD class=HeaderTitle noWrap align=left height="14">Actual S +tart</TD> <TD class=HeaderTitle noWrap align=left height="14">Schedule + Name</TD> <TD class=HeaderTitle noWrap align=left height="14">Node Nam +e</TD> <TD class=HeaderTitle noWrap align=left height="14">Domain N +ame</TD></TR> <TR class=AltLight height=22> <TD class=AltLightNoVline align=middle height="17" width="10 +"> </TD> <TD class=AltLight align=left height="17">Completed</TD> <TD class=AltLight align=left height="17">Successful</TD> <TD class=AltLight align=left height="17">2016-01-17-17.00</ +TD> <TD class=AltLight align=left height="17">2016-01-17-17.09</ +TD> <TD class=AltLight align=left height="17">NJDLYBACKUP_5PM</T +D> <TD class=AltLight align=left height="17">APX23</TD> <TD class=AltLight align=left height="17">ST15_DOMAIN</TD></ +TR> <TR class=AltDark height=22> <TD class=AltLightNoVline align=middle height="17" width="10"> + </TD> <TD class=AltLight align=left height="17">Missed</TD> <TD class=AltLight align=left height="17">Successful</TD> <TD class=AltLight align=left height="17">2016-01-17-17.00</ +TD> <TD class=AltLight align=left height="17">2016-01-17-17.09</ +TD> <TD class=AltLight align=left height="17">NJDLYBACKUP_5PM</T +D> <TD class=AltLight align=left height="17">APX24</TD> <TD class=AltLight align=left height="17">ST15_DOMAIN</TD></ +TR> </TABLE> </DIV></TD> </TR></TABLE> </td> </tr> <tr><td width="100%"><p> <DIV class=HeaderBar>Node Activity Summary</DIV> <TABLE class=HeaderFrame height=100 cellSpacing=0 cols=3 cellPadding=0 + width="100%" border=0 align="left"> <TR vAlign=top height=100> <TD vAlign=top width="100%" height="100"> <DIV style="overflow: auto; width: "100%"; height: 200; valign: +top"> <TABLE cellSpacing=0 cols=4 cellPadding=0 width="100%" border=0 +height="100"> <TR height=25 nowrap> <TD class=HeaderTitleNoVLine height="14" width="10">&nbsp;</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Node Nam +e</TD> <TD class=HeaderTitle noWrap align=left height="14">Node Ver +sion</TD> <TD class=HeaderTitle noWrap align=left height="14">OS Platf +orm</TD> <TD class=HeaderTitle noWrap align=left height="14">OS Versi +on</TD> <TD class=HeaderTitle noWrap align=left height="14">Activity +</TD> <TD class=HeaderTitle noWrap align=left height="14">Bytes Tr +ansferred</TD></TR> <TR class=AltLight height=22> <TD class=AltLightNoVline align=middle height="17" width="10 +"> </TD> <TD class=AltLight align=left height="17">RDFXDB11</TD> <TD class=AltLight align=left height="17">7.1.0.0</TD> <TD class=AltLight align=left height="17">WinNT</TD> <TD class=AltLight align=left height="17">6.01</TD> <TD class=AltLight align=left height="17">BACKUP</TD> <TD class=AltLight align=left height="17">105,806,011,655</T +D></TR> <TR class=AltDark height=22> </TABLE> </DIV></TD> </TR></TABLE> </p> </td> </tr> </table> </body> </html>
Re: HTML::TreeBuilder scan for first table ( HTML::TreeBuilder::XPath )
by Anonymous Monk on Jan 22, 2016 at 00:32 UTC
      Thanks for the suggest everyone. I will try the suggestions. Is that an easier way to inspect the tree elements in TreeBuilder or TableExtract? For example, there are online parser that you can test regex, I am curious if there is anything similar that can help debug when the element is not being retrieve as expected?

        Is that an easier way to inspect the tree elements in TreeBuilder or TableExtract?

        Which name is mentioned in the code?

        For example, there are online parser that you can test regex, I am curious if there is anything similar that can help debug when the element is not being retrieve as expected?

        The *xpather*s help you craft xpaths you can use to retrieve the stuff you want

        When the html changes significantly, you run the *xpather*s to craft new xpaths

      Sorry - Guess I got lost in the formatting of the post I replied to. I may not have seen XPath formatted this way before.

      > Or even all in one xpath expression > my @headers = $tree->findvalues( q{ > ( > //table[ @class = 'HeaderFrame' ] > )[1] > //td[ @class = 'HeaderTitle' ] > } );

      It's mostly my fault but maybe below is easier to follow possibly more familiar and compact?

      my @headers = $tree->findvalues( '(//table[@class="HeaderFrame"])[1]//td[@class="HeaderTitle"]' );
      Ron
Re: HTML::TreeBuilder scan for first table
by codiac (Beadle) on Jan 24, 2016 at 22:59 UTC
    Did you read the POD for this method?
    look_down
    
      @elements = $h->look_down( ...criteria... );
      $first_match = $h->look_down( ...criteria... );
    
    This starts at $h and looks thru its element descendants (in pre-order), looking for elements matching the criteria you specify. In list context, returns all elements that match all the given criteria; in scalar context, returns the first such element (or undef, if nothing matched).
    
    my $table = $h->look_down('_tag', 'table'); my @rows = $table->look_down(_tag => 'td', class => qr/HeaderTitle\b?/ +);
      Yup I did read the POD, I was trying to use one look_down method but obviously I cannot do that.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1153329]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-20 10:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found