Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

RE: RE: RE: A grammar for HTML matching

by dchetlin (Friar)
on Nov 01, 2000 at 10:48 UTC ( #39387=note: print w/replies, xml ) Need Help??

in reply to RE: RE: A grammar for HTML matching
in thread A grammar for HTML matching

I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something.

It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code.

-dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...

  • Comment on RE: RE: RE: A grammar for HTML matching

Replies are listed 'Best First'.
RE: RE: RE: RE: A grammar for HTML matching
by mcelrath (Novice) on Nov 01, 2000 at 11:48 UTC
    I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for

    Oh, HTML::Parser works. It's just painfully slow. I started using HTML::Parser for this application, and the minimum time to traverse just about any document is about 1 second. As the complexity of the matching code grew, it got even slower. HTML::Parser would be a more appropriate solution if it could only called start() for some set of specified tags. Or for tags with an attrib that matches some regex. But now I'm getting away from HTML::Parser and starting to specify the grammar I'm interested in. By comparison, by using a regex and then growing it to include appropriate tags, I can do these matches in 0.01 seconds (sometimes ;).

    Another way to look at this is that writing these HTML matching rules is simply much easier and faster than writing a HTML::Parser script. (Ok, so maybe some of you uber-hackers can whip up HTML::Parser in seconds ;) These matcher scripts are far shorter than the HTML::Parser based script to implement them, and don't require knowledge of perl.

    -bsm, who has looked at tchrist's proxy, and just looked again, and it's faster than I had thought (HTML::Parser based). But man it mangles pages. But speed is only tangentially related to this idea. I didn't want to debate the merits of HTML::Parser, but rather see if this HTML matching idea has merit. Even if it's implemented using HTML::Parser.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://39387]
[Corion]: 1nickt: Finding autobox in production would give me pause, yes
[LanX]: efficient survey
[MidLifeXis]: And under MINGW64_NT-6.1 MYHOST 2.6.0(0.304/5/3) 2016-09-09 09:46 x86_64 Msys there seem to be issues with escapes in external build tool calls.
[Corion]: I mean, it's a technical feat it achieves, but... why? ;)
[MidLifeXis]: And it also has the 0.14 version of the tarball in its manifest.
[LanX]: avoiding unreadable brackets
[MidLifeXis]: Although the previous one could be a b0rken PATH, I would need to dig for that.
[thezip]: I've got to go to meetings now. If anyone has further comments regarding Spreadsheet::XLSX deployment to Strawberry Perl 5.24.1, please /msg me -- thanks!
LanX has to go ... plans to crash with a car into a group of pythonistas while screaming "LARRY IS THE GREATEST"
LanX ... darn ... where is my car?

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2017-03-23 17:28 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (291 votes). Check out past polls.