Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

RE: RE: RE: A grammar for HTML matching

by dchetlin (Friar)
on Nov 01, 2000 at 10:48 UTC ( #39387=note: print w/replies, xml ) Need Help??

in reply to RE: RE: A grammar for HTML matching
in thread A grammar for HTML matching

I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something.

It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code.

-dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...

  • Comment on RE: RE: RE: A grammar for HTML matching

Replies are listed 'Best First'.
RE: RE: RE: RE: A grammar for HTML matching
by mcelrath (Novice) on Nov 01, 2000 at 11:48 UTC
    I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for

    Oh, HTML::Parser works. It's just painfully slow. I started using HTML::Parser for this application, and the minimum time to traverse just about any document is about 1 second. As the complexity of the matching code grew, it got even slower. HTML::Parser would be a more appropriate solution if it could only called start() for some set of specified tags. Or for tags with an attrib that matches some regex. But now I'm getting away from HTML::Parser and starting to specify the grammar I'm interested in. By comparison, by using a regex and then growing it to include appropriate tags, I can do these matches in 0.01 seconds (sometimes ;).

    Another way to look at this is that writing these HTML matching rules is simply much easier and faster than writing a HTML::Parser script. (Ok, so maybe some of you uber-hackers can whip up HTML::Parser in seconds ;) These matcher scripts are far shorter than the HTML::Parser based script to implement them, and don't require knowledge of perl.

    -bsm, who has looked at tchrist's proxy, and just looked again, and it's faster than I had thought (HTML::Parser based). But man it mangles pages. But speed is only tangentially related to this idea. I didn't want to debate the merits of HTML::Parser, but rather see if this HTML matching idea has merit. Even if it's implemented using HTML::Parser.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://39387]
[ambrus]: As for TeX, I'm not trying to discourage anyone from writing carefully beautifully typeset documents, in maths or outside. But most people aren't willing to do that, and will spend only little time about the formatting,
[ambrus]: and try to leave everything else to automated systems without checking how what they write came out format-wise, and for those people, discounting the part about journals with a specific format above,
[ambrus]: just blindly recommending to use LaTeX is a bad idea now.
[Discipulus]: I havery limited needs in such field. Sometimes (for children party invitations) i draw something by hand and I scan it to include into an ms doc.
Discipulus havery can be a good neologism for: have very..;=)

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2017-09-26 11:22 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (293 votes). Check out past polls.