http://www.perlmonks.org?node_id=25730

jcwren has asked for the wisdom of the Perl Monks concerning the following question:

Thanks to ase (who I can't put in brackets, I get a server error), I've been playing with the HTML::TableExtract module. This is a really slick little module for extracting table data from HTML pages. However, it has a minor drawback for what I'm trying to do. If there is any HTML data between the <TD> and </TD> tags, it gets stripped. I would like it to return the HTML between the tags, and I've figured out how to do that. Unfortunately, I can't figure out how to access the data I've stored. Below is a model of what's happening:

I need to override the _add_text() method in the HTML::TableExtract::TableState class, which I can do with 'sub HTML::TableExtract::TableState::_add_text'. This is dirty, but works (with a warning). I'd rather subclass the HTML::TableExtract::TableState package, and invoke the parent _add_text() routine with a $self->SUPER::_add_text() call. However, since the HTML::TableExtract::TableState package is internal to the HTML::TableExtract module, and HTML::TableExtract explicitly does a '$ts = new HTML::TableExtract::TableState()', I don't know how to accomplish the goal.

The _add_text() that I provide needs to access the data I've stored in the jcwExtract module. If I can either figure out how to access the parent's parent data (HTML::TableExtract::TableState -> HTML::TableExtract -> ->jcwExtract), I can do this, but it feels unclean. I'd rather figure out how to subclass the HTML::TableExtract::TableState module and override the _add_text() method.

I would post the code, but it's a little lengthy, so instead, here's a link to it. It's difficult to boil down to a short test case, but I'll try to add some more to it in a bit. I'll be happy to try any suggestions anyone has as to how to pull this off...

--Chris
  • Comment on Life in the land of OOP, and I'm confused.

Replies are listed 'Best First'.
Re: Life in the land of OOP
by merlyn (Sage) on Aug 02, 2000 at 18:14 UTC
    The author of HTML::TableExtract apparently did not make the interface "pluggable", allowing you to optimize behavior on not only the class you're inheriting, but on classes it also creates and uses.

    This is typical, in my observation. Unless a class is designed very very very carefully, it's generally not cleanly subclassable for all needs.

    For this particular case, you'll probably have to override any method that refers by name to HTML::TableExtract::TableState to create a new class of your choosing. And yes, that'll require cutting and pasting code for the parts that didn't change. Sucks, doesn't it?

    What's missing is a method like:

    sub createTableState { my $self = shift; return HTML::TableExtract::TableState->new(@_); } sub initialize_some_stuff { my $self = shift; blah blah; $self->{state} = $self->createTableState; blah blah; }
    Then you could override just the thing that makes the child object to make one of your object. Write the author and maybe they'll put that in.

    -- Randal L. Schwartz, Perl hacker

      Using the crasser method of simply providing a HTML::TableExtract::TableState::_add_text() method, how would I go about accessing the data of the jcwExtract class?

      I can figure out to access the parent class data (HTML::TableExtract), but not it's parent (jcwExtract). Any thoughts?

      --Chris

      e-mail jcwren
Re: Life in the land of OOP
by eak (Monk) on Aug 02, 2000 at 20:18 UTC
    Here is a quick and dirty parser using HTML::TokeParser, which is an alternative interface to HTML::Parser.
    #!/usr/bin/perl -w # use HTML::TokeParser; my $p = HTML::TokeParser->new("index.html"); while (my $token = $p->get_token) { if($token->[0] eq 'S' and $token->[1] eq 'td'){ print $p->get_text('td')."\n"; } }
Re: Life in the land of OOP
by ase (Monk) on Aug 03, 2000 at 13:44 UTC
    This may seem overly simple but, wouldn't it be possible to add the non-tag stripping mode as a parser option to HTML::TableExtract, so that you wouldn't have to subclass and override at all?
    Just a stray thought,
    -ase
Re: Life in the land of OOP, and I'm confused.
by mojotoad (Monsignor) on Nov 07, 2002 at 16:22 UTC
    This is old news, but HTML::TableExtract has had a 'keep_html' parameter since version 1.06 (current ver 1.08).

    merlyn's comments regarding pluggability are spot on. Mea culpa...

    Matt