Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Extract HTML Table rows

by kalyanrajsista (Scribe)
on Dec 23, 2009 at 13:57 UTC ( #814093=perlquestion: print w/replies, xml ) Need Help??

kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

hello all

I'm trying to extract table rows from following HTML. It is showing the desired output but is there any other way that I can map my table rows with key-value pair or array of arrays of td elements under a tr

<html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html>

I'm trying the following code

use strict; use warnings; use HTML::TreeBuilder; #Parse html content using html-treebuilder: my $root = HTML::TreeBuilder->new(); $root->parse($html); $root->eof(); my @tables = $root->look_down(_tag => 'table'); while (@tables) { my $node = shift @tables; if (ref $node) { unshift @tables, $node->content_list; } else { print $node,"\n"; } } $root = $root->delete;

OUTPUT is

---------- Perl ---------- Short Name: John Long Name: John Abraham Company: Idea Currency: EUR Output completed (0 sec consumed) - Normal Termination

Replies are listed 'Best First'.
Re: Extract HTML Table rows
by bobf (Monsignor) on Dec 23, 2009 at 14:44 UTC

    I have found HTML::TableExtract to be easy to use in simple cases:

    use strict; use warnings; use HTML::TableExtract; my $content; { local $/ = undef; # slurp mode $content = <DATA>; } my $te = HTML::TableExtract->new(); $te->parse( $content ); foreach my $ts ( $te->tables() ) { foreach my $row ( $ts->rows() ) { print join ( "\t", @$row ), "\n"; } } __DATA__ <html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html>

Re: Extract HTML Table rows
by suaveant (Parson) on Dec 23, 2009 at 14:40 UTC
    There are modules specifically to handle html tables...

    HTML::TableExtract
    HTML::TableParser

                    - Ant
                    - Some of my best work - (1 2 3)

Re: Extract HTML Table rows
by wfsp (Abbot) on Dec 23, 2009 at 15:16 UTC
    The modules recommended by suaveant and bobf are a good bet. If you wanted to use HTML::TreeBuilder the following would be one way to do it.
    #! /usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent=1; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_file(*DATA); my ($table) = $t->look_down(_tag => q{table}); my @rows = $table->look_down(_tag => q{tr}); my %db; for my $row (@rows){ my $key = $row->look_down(class => q{rlab})->as_text; my $value = $row->look_down(class => q{l})->as_text; $db{$key} = $value; } for my $key (keys %db){ printf qq{%s -> %s\n}, $key, $db{$key}; } __DATA__ <html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html>
    Company: -> Idea Long Name: -> John Abraham Currency: -> EUR Short Name: -> John
    I've assumed that
    • there is one table,
    • each row has two columns each with a class as in your sample data
    You would probably want to include some error checking to confirm those assumptions though.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://814093]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2022-12-07 20:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?