Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Re^2: Cleanning HTML - New/better module for that - test please! ;-P

by thpfft (Chaplain)
on Apr 27, 2003 at 18:59 UTC ( [id://253511]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Cleanning HTML - New/better module (regexes for html)
in thread Cleanning HTML - New/better module for that - test please! ;-P

It is true, of course, that it would be very difficult to recreate HTML::Parser in pure perl without using any regexes, though it does not follow from there that it is a good idea to recreate HTML::Parser in pure perl.

It is also true that factors you describe are orthogonal, but only if you restrict the phrase 'use pattern matching' to its most drily correct application. In more informal usage it is common to talk of 'using regexes' as one way of parsing html and 'using the parser' as another, better way. I speak from chastening experience here.

So, to clarify, you are advising the OP to write his own parser in perl using plenty of regexes, and to restrict himself to only the most exact usage of words and operators? Which doesn't seem very perly, but I'm only a lowly bishop and easily muddled :)

  • Comment on Re: Re^2: Cleanning HTML - New/better module for that - test please! ;-P

Replies are listed 'Best First'.
Re^4: Cleanning HTML - New/better module (out of hand dismissal?)
by Aristotle (Chancellor) on Apr 27, 2003 at 19:16 UTC

    Whatever your rank is or mine doesn't have anything to do with it.

    I'm not saying anything about any of the OP's points either - yes, he would probably be better off using HTML::Parser. (There are reasons against this too, sometimes. Depends on too many factors to discuss here, I'll just assume you know what I mean.)

    What I was pointing out is that you saw pattern matching and assumed he was 'using regexes' as in common parlance. But pattern matching can (and pretty much has to) be used for a proper parser too, so before you throw out blanket statements like "don't use regexes for parsing HTML" please have a look at what he's actually doing.

    (His parser is defective - there are really three modes in *ML: text, tags, and attribute tag values. You have to parse the value assigned to an attribute separately from the tag- and attribute names, mainly because right angle brackets appearing inside an attribute value don't terminate a tag. gmpassos' code doesn't take this into account.)

    Makeshifts last the longest.

      I know. Just teasing. But I did look at the original code quite closely, before I decided it was of the kind that merited what you describe as a blanket response. I came to that conclusion partly because the classic failure to deal with <img ...alt=">"> - that you mention - reveals a lack of acquaintance with the debate. For the record, I think gmpassos' code is rather good, as that sort of solution goes; certainly better than anything I managed before it was forcefully put to me that better mechanisms existed already. I should probably have said that, but I didn't really feel entitled to pass judgement on the quality, just the approach.

      btw, I don't know if you've read the discussion i linked to in the first post. It will put this one in useful context.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://253511]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-25 23:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found