Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

RE: A grammar for HTML matching

by dchetlin (Friar)
on Nov 01, 2000 at 09:56 UTC ( #39383=note: print w/ replies, xml ) Need Help??


in reply to A grammar for HTML matching

It's not clear to me exactly what problem you're trying to solve, but the idea is interesting. It's true that HTML::Parser and friends can be slow. However, I'm sure you're aware of the difficulty of parsing markup correctly. Basically, for something like this there's a tradeoff between speed and generality; my guess is that you could put together something much faster that served your particularly purpose here, but wouldn't scale or solve much else.

In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind.

Finally, if you haven't seen HTML::TreeBuilder, take a look at that. My guess is that you know about it and would consider it too slow as well, but just in case, there it is.

-dlc


Comment on RE: A grammar for HTML matching
RE: RE: A grammar for HTML matching
by mcelrath (Novice) on Nov 01, 2000 at 10:24 UTC
    In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind.

    Well, both. The idea is that I only care about a small part of the total document. I don't want to have to examine all the irrelevant parts of the document just to get to the part I'm interested in. The benefits of this are speed and invariance to document layout. If you know the summary for the book follows <p>Summary you can ignore the rest of the document. I want to respect document structure within the segment I'm interested in, but disregard the rest.

    HTML::TreeBuilder is a subclass of HTML::Parser, and while this idea could be implemented using it, the idea is that it doesn't have to be.

    The applications I have in mind for this are:

    1. Ad-filtering by stripping selected portions of HTML for my pet project FilterProxy. As I've developed this, my mechanisms for specifying how to find the piece of HTML finding the ad has gone through many revisions and has been pretty convoluted (but I'm evolving toward the syntax in my original message). This "HTML matcher" idea would be perfect.
    2. scripts which extract data from web sites without using the entire web page. For instance, ShowTimes, an app for the Palm Pilot, which downloads movie theatres, times, and plot summaries for movies from several websites. (yahoo.com for movies, imdb for summaries). But every time the site(s) go through minor revisions the script breaks. I'm sure there are others who have a custom script to grab a specific piece of data from a web page. Wouldn't it be convienent to just specify a "matcher" like I've outlined?
      I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something.

      It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code.

      -dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...

        I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for

        Oh, HTML::Parser works. It's just painfully slow. I started using HTML::Parser for this application, and the minimum time to traverse just about any document is about 1 second. As the complexity of the matching code grew, it got even slower. HTML::Parser would be a more appropriate solution if it could only called start() for some set of specified tags. Or for tags with an attrib that matches some regex. But now I'm getting away from HTML::Parser and starting to specify the grammar I'm interested in. By comparison, by using a regex and then growing it to include appropriate tags, I can do these matches in 0.01 seconds (sometimes ;).

        Another way to look at this is that writing these HTML matching rules is simply much easier and faster than writing a HTML::Parser script. (Ok, so maybe some of you uber-hackers can whip up HTML::Parser in seconds ;) These matcher scripts are far shorter than the HTML::Parser based script to implement them, and don't require knowledge of perl.

        -bsm, who has looked at tchrist's proxy, and just looked again, and it's faster than I had thought (HTML::Parser based). But man it mangles pages. But speed is only tangentially related to this idea. I didn't want to debate the merits of HTML::Parser, but rather see if this HTML matching idea has merit. Even if it's implemented using HTML::Parser.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://39383]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2014-12-20 05:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls