http://www.perlmonks.org?node_id=39387


in reply to RE: RE: A grammar for HTML matching
in thread A grammar for HTML matching

I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something.

It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code.

-dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...

  • Comment on RE: RE: RE: A grammar for HTML matching

Replies are listed 'Best First'.
RE: RE: RE: RE: A grammar for HTML matching
by mcelrath (Novice) on Nov 01, 2000 at 11:48 UTC
    I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for

    Oh, HTML::Parser works. It's just painfully slow. I started using HTML::Parser for this application, and the minimum time to traverse just about any document is about 1 second. As the complexity of the matching code grew, it got even slower. HTML::Parser would be a more appropriate solution if it could only called start() for some set of specified tags. Or for tags with an attrib that matches some regex. But now I'm getting away from HTML::Parser and starting to specify the grammar I'm interested in. By comparison, by using a regex and then growing it to include appropriate tags, I can do these matches in 0.01 seconds (sometimes ;).

    Another way to look at this is that writing these HTML matching rules is simply much easier and faster than writing a HTML::Parser script. (Ok, so maybe some of you uber-hackers can whip up HTML::Parser in seconds ;) These matcher scripts are far shorter than the HTML::Parser based script to implement them, and don't require knowledge of perl.

    -bsm, who has looked at tchrist's proxy, and just looked again, and it's faster than I had thought (HTML::Parser based). But man it mangles pages. But speed is only tangentially related to this idea. I didn't want to debate the merits of HTML::Parser, but rather see if this HTML matching idea has merit. Even if it's implemented using HTML::Parser.