<?xml version="1.0" encoding="windows-1252"?>
<node id="280493" title="4Re: How do I extract text from an HTML page?" created="2003-08-03 15:02:28" updated="2005-07-30 20:21:26">
<type id="11">
note</type>
<author id="18800">
jeffa</author>
<data>
<field name="doctext">
Two suggestions: test before you post (and then test some
more) and don't use regexes to parse HTML. Granted, this
is trivial HTML to parse, but the more you use parsers, the
better you get at it. Also, the more you use templating
modules, the better you get at them. I hate to just outright
solve the problem, but i did. Here is link that you will have to click to see my solution - so any readers have
been warned.
&lt;p&gt;
&lt;tt&gt;&amp;lt;blink&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;a href="?node_id=280493&amp;displaytype=displaycode"&gt;WARNING *SPOILERS* CLICK AT OWN RISK!&lt;/a&gt;&lt;br/&gt;
&lt;tt&gt;&amp;lt;/blink&gt;&lt;/tt&gt;
&lt;/p&gt;
&lt;!--
&lt;code&gt;
#!/usr/bin/perl -T

use strict;
use warnings;
use CGI;
use LWP::Simple;
use HTML::Template;
use HTML::TokeParser::Simple;

my $query    = CGI-&gt;new;
my $template = HTML::Template-&gt;new(
   filehandle =&gt; \*DATA,
   associate  =&gt; $query,
);

my %student = get_students();
my $student = $query-&gt;param('student') || '';

if (exists $student{$student}) {
   $template-&gt;param(%{$student{$student}});
} elsif ($query-&gt;param('go')) {
   $template-&gt;param(error=&gt;"$student does not exist. Try again.",);
}

print $query-&gt;header, $template-&gt;output;

sub get_students {
   my $html   = get('http://jamiel.jalme.com/finalgrades.html');
   my $parser = HTML::TokeParser::Simple-&gt;new(\$html);

   my (%student,$token,$name,$grade);
   while ($token = $parser-&gt;get_token) {
      if ($token-&gt;is_start_tag('b')) {
         $token = $parser-&gt;get_token;
         $name = $token-&gt;as_is;
      } elsif ($token-&gt;is_end_tag('b')) {
         $token = $parser-&gt;get_token;
         ($grade) = $token-&gt;as_is =~ /(\d+)$/;
         $student{$name} = {
            full_name  =&gt; $name,
            grade      =&gt; $grade,
            first_name =&gt; (split /\s+/, $name)[0],
         }
      }
   }
   return %student;
}

__DATA__
&lt;tmpl_unless full_name&gt;

&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Student Grade Lookup&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;tmpl_if error&gt;&lt;p&gt;&lt;tmpl_var error&gt;&lt;/p&gt;&lt;/tmpl_if&gt;
&lt;form&gt;
Enter name of student:
&lt;input type="text" name="student" /&gt;&lt;br/&gt;
&lt;input type="submit" name="go" value="Enter" /&gt;
&lt;/form&gt;
&lt;/body&gt;
&lt;/html&gt;

&lt;tmpl_else&gt;

&lt;html&gt;
&lt;head&gt;
&lt;title&gt;CONGRATULATIONS TO THE PARTICIPANT &lt;tmpl_var first_name&gt;&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;h1&gt;&lt;tmpl_var full_name&gt;&lt;/h1&gt; Has just gotteh a "&lt;tmpl_var grade&gt;" as a
grade for this Seminar. Congratulations!!!&lt;br&gt;
&lt;/body&gt;
&lt;/html&gt;

&lt;/tmpl_unless&gt;
&lt;/code&gt;
--&gt;
&lt;p&gt;
It's a lot more code than you posted, but it does a lot
more as well. ;)
&lt;/p&gt;
&lt;p&gt;jeffa&lt;/p&gt;
&lt;font size=1&gt;
&lt;pre&gt;
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(&lt;a href="http://jeffa.perlmonk.org/tripdid.mp3"&gt;the triplet paradiddle with high-hat&lt;/a&gt;)
&lt;/pre&gt;&lt;/font&gt;</field>
<field name="root_node">
280396</field>
<field name="parent_node">
280473</field>
</data>
</node>
