Life in the land of OOP, and I'm confused.

jcwren has asked for the wisdom of the Perl Monks concerning the following question:

Thanks to ase (who I can't put in brackets, I get a server error), I've been playing with the HTML::TableExtract module. This is a really slick little module for extracting table data from HTML pages. However, it has a minor drawback for what I'm trying to do. If there is any HTML data between the <TD> and </TD> tags, it gets stripped. I would like it to return the HTML between the tags, and I've figured out how to do that. Unfortunately, I can't figure out how to access the data I've stored. Below is a model of what's happening:

HTML::TableExtract is a sub-class of HTML::Parser.
jcwExtract is a sub-class of HTML::TableExtract.
jcwExtract overrides the start() method of HTML::TableExtract, and successfully pulls the data I want, storing it in a private hash (the classic $self->{sometag} = thatdata)
HTML::Extract has an internal package called HTML::TableExtract::TableState.
HTML::TableExtract::TableState has a method called _add_text(), that adds the text found in the table to a 'TableState object'

I need to override the _add_text() method in the HTML::TableExtract::TableState class, which I can do with 'sub HTML::TableExtract::TableState::_add_text'. This is dirty, but works (with a warning). I'd rather subclass the HTML::TableExtract::TableState package, and invoke the parent _add_text() routine with a $self->SUPER::_add_text() call. However, since the HTML::TableExtract::TableState package is internal to the HTML::TableExtract module, and HTML::TableExtract explicitly does a '$ts = new HTML::TableExtract::TableState()', I don't know how to accomplish the goal.

The _add_text() that I provide needs to access the data I've stored in the jcwExtract module. If I can either figure out how to access the parent's parent data (HTML::TableExtract::TableState -> HTML::TableExtract -> ->jcwExtract), I can do this, but it feels unclean. I'd rather figure out how to subclass the HTML::TableExtract::TableState module and override the _add_text() method.

I would post the code, but it's a little lengthy, so instead, here's a link to it. It's difficult to boil down to a short test case, but I'll try to add some more to it in a bit. I'll be happy to try any suggestions anyone has as to how to pull this off...

--Chris

Comment on Life in the land of OOP, and I'm confused.

Replies are listed 'Best First'.
Re: Life in the land of OOP by merlyn (Sage) on Aug 02, 2000 at 18:14 UTC
The author of `HTML::TableExtract` apparently did not make the interface "pluggable", allowing you to optimize behavior on not only the class you're inheriting, but on classes it also creates and uses. This is typical, in my observation. Unless a class is designed very very very carefully, it's generally not cleanly subclassable for all needs. For this particular case, you'll probably have to override any method that refers by name to `HTML::TableExtract::TableState` to create a new class of your choosing. And yes, that'll require cutting and pasting code for the parts that didn't change. Sucks, doesn't it? What's missing is a method like: `sub createTableState { my $self = shift; return HTML::TableExtract::TableState->new(@_); } sub initialize_some_stuff { my $self = shift; blah blah; $self->{state} = $self->createTableState; blah blah; }` [download] Then you could override just the thing that makes the child object to make one of your object. Write the author and maybe they'll put that in. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
(jcwren) RE: Re: Life in the land of OOP by jcwren (Prior) on Aug 03, 2000 at 07:24 UTC
Using the crasser method of simply providing a HTML::TableExtract::TableState::_add_text() method, how would I go about accessing the data of the jcwExtract class? I can figure out to access the parent class data (HTML::TableExtract), but not it's parent (jcwExtract). Any thoughts? --Chris e-mail jcwren	[reply]
Re: Life in the land of OOP by eak (Monk) on Aug 02, 2000 at 20:18 UTC
Here is a quick and dirty parser using HTML::TokeParser, which is an alternative interface to HTML::Parser. `#!/usr/bin/perl -w # use HTML::TokeParser; my $p = HTML::TokeParser->new("index.html"); while (my $token = $p->get_token) { if($token->[0] eq 'S' and $token->[1] eq 'td'){ print $p->get_text('td')."\n"; } }` [download]	[reply] [d/l]
Re: Life in the land of OOP by ase (Monk) on Aug 03, 2000 at 13:44 UTC
This may seem overly simple but, wouldn't it be possible to add the non-tag stripping mode as a parser option to HTML::TableExtract, so that you wouldn't have to subclass and override at all? Just a stray thought, -ase	[reply]
Re: Life in the land of OOP, and I'm confused. by mojotoad (Monsignor) on Nov 07, 2002 at 16:22 UTC
This is old news, but HTML::TableExtract has had a 'keep_html' parameter since version 1.06 (current ver 1.08). merlyn's comments regarding pluggability are spot on. Mea culpa... Matt	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks