Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

HTML Table Parse

by Agyeya (Hermit)
on May 13, 2004 at 10:03 UTC ( #353023=perlquestion: print w/replies, xml ) Need Help??

Agyeya has asked for the wisdom of the Perl Monks concerning the following question:

I have a webpage which shows a calendar.The calendar is in html code in the form of a table.This table is inside another table. The days are in 3 different colours:Red blue and black. Each colour indicating a different status. Also there is no "th" table header in the tables.
I tried using HTML::Parser, but could get the proper output. What i want is to check the colour of each day and record this into a table in MySql. This table will indicate day and colour.
A sample of the HTML code for a day is a given below
<td align="Center" bgcolor="White" width="14%"><font face="Verdana" co +lor="Black" size="1">2</font></td>
This refers to the second day of the month, '2', and has the color "Black".
So how do i find out what is the attribute value of each day and how do i go through the days in increasing order.
Coz if i search for the text '1' , it might turn up with 1, 10..19,21,etc

Replies are listed 'Best First'.
Re: HTML Table Parse
by matija (Priest) on May 13, 2004 at 11:13 UTC
    use HTML::TokeParser; $p = HTML::TokeParser->new("$tablesource") || die "Can't open $tablesource: $!"; # .... while (my $token = $p->get_token) { if (($$token[0] eq "S") && (lc $$token[1] eq "td") { $intd=1; } elsif ((($$token[0] eq "E") && (lc $$token[1] eq "td") { $intd=0; } elsif (($$token[0] eq "S") && (lc $$token[1] eq "font") { $color=$$token[3]->{color}; } elsif (($$token[0] eq "E") && (lc $$token[1] eq "font") $color=$defaultcolor; } elsif ($$token[0] eq "T") { next unless $intd; $text=$$token[1]; $color{$text}=$color; } }
    At this point you have a hash called %color where the key is the number from the table, and the value is the color.
Re: HTML Table Parse
by cees (Curate) on May 13, 2004 at 12:43 UTC

    Another module that you might be able to use is Template::Extract. The data would need to be structured fairly consistently for this to work, but it can be an easy way to pull data out of an HTML page. From the perldocs:

    This module adds template extraction functionality to the Template toolkit. It can take a rendered document and its template together, and get the original data structure back, effectively reversing the Template::process function.

    Here is something to get you started:

    use Template::Extract; use Data::Dumper; my $obj = Template::Extract->new; my $template = << '.'; <table>[% FOREACH record %]<td align="Center" bgcolor="[% color %]" wi +dth="14%"><font face="Verdana" color="Black" size="1">[% day %]</font +></td>[% ... %][% END %]</table> . my $document = << '.'; <table><td align="Center" bgcolor="White" width="14%"><font face="Verd +ana" color="Black" size="1">2</font></td> <td align="Center" bgcolor="Red" width="14%"><font face="Verdana" colo +r="Black" size="1">3</font></td></table> . print Data::Dumper::Dumper( $obj->extract($template, $document) );

    That prints out the following:

    $VAR1 = { 'record' => [ { 'day' => '2', 'color' => 'White' }, { 'day' => '3', 'color' => 'Red' } ] };

    - Cees

Re: HTML Table Parse
by ambrus (Abbot) on May 13, 2004 at 10:14 UTC

    Coz if i search for the text '1' , it might turn up with 1, 10..19,21,etc

    /\b1\b/, see perlre

Re: HTML Table Parse
by TilRMan (Friar) on May 13, 2004 at 10:34 UTC

    I admit I can't get my head around HTML::Parser yet, but perhaps you can cheat and use a regular expression.

    while (<FH>) { if (m[width="14%"><font \S+ color="(.*)" \S+>(\d+)<]) { my ($color, $day) = ($1, $2); print "Day $day is $color\n"; } }

    IMO, when the HTML you are parsing is being generated consistently, a regexp will be reliable enough and easy to adapt in the future. For safety, you can check that you found 30ish days.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://353023]
Approved by gellyfish
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2022-09-27 11:59 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (119 votes). Check out past polls.