Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: How to use Regular Expressions with HTML

by Anonymous Monk
on Aug 16, 2003 at 12:50 UTC ( #284328=note: print w/replies, xml ) Need Help??

in reply to How to use Regular Expressions with HTML

As we all know, the canonical example of what not to do with regular expressions is to parse HTML.

It always bugs me when I see people say this. Its one of those self-defeating generalizations that just confuses things because people observe that when taken literally it often isn't true.

If I have a static piece of HTML, especially machine generated and/or simply structured I can easily munge and extract with a regex or two and a bit of logic. This will take far less time than using HTML::Parser or HTML::TokeParser or HTML::TreeBuilder or your tokenizer here.

On the other hand it is very difficult to parse any arbitrary page using the same approach. In fact it is usually trivial to reverse engineer a regex based parser to construct an HTML snippet that will break the parser.

Anyway my point is that parsing any arbitrary HTML is hard to do with regexes, however on occasion it can be just the thing you need to rip the essential data out of some specific web-page or html-report. If you are only going to run the extractor once then sometimes propper parsing is just too big a hammer to get out of the box. Accordingly i'd prefer to see that line rephrased.


  • Comment on Re: How to use Regular Expressions with HTML

Replies are listed 'Best First'.
Re: Re: How to use Regular Expressions with HTML
by Ovid (Cardinal) on Aug 16, 2003 at 17:02 UTC

    Here are a few other generalizations:

    • Use strict.
    • Don't reinvent the wheel.
    • Don't use goto.
    • Don't optimize up front.
    • OO modules shouldn't export anything.

    Those are all great ideas and Perl programmers would be better off if they lived by them. That being said, I've broken every one of those rules and will happily do so in the future, if need be. The important thing is that I understand the reasoning behind those things and try to live by them.

    From what I can see from your post, you have the same opinion about HTML that I do, but you spent a lot of time qualifying it. I have that sort of attitude regarding my above list of generalizations, but I'd never get a single post finished if I was forced to make all of those qualifications. I toss out the generalizations first and then list exceptions only if needed.

    In short, I'm not arguing with you, but for most situations that I encounter, whipping out regular expressions for HTML is a bad idea and encouraging programmers to follow that practice would be an even worse idea.


    New address of my CGI Course.

      I finished reading Christopher Alexander's The Timeless Way of Building today. Alexander writes about architecture, but his ideas, such as patterns, have been adopted by software developers.

      Your list above (use strict, don't reinvent the wheel, etc.) is basically a list of Perl patterns, practices that should exist in well written programs.

      Although we learn good habits by following rules, we ultimately derive those rules from observing what we find good. Patterns, or best practice, summarise our experiences and allow us to share them with others.

      In his last chapter, Alexander notes that another place can be without the patterns which apply to it, and yet still be alive: we should follow the spirit of the rules we lay down, not the letter. So paradoxically you learn that you can only make a building live when you are free enough to reject even the very patterns which are helping you once you understand the patterns well.

      Tim Bray uses Perl's regular expressions to parse XML and you use regexps to parse HTML. I don't anticipate doing either any time soon, because the general problems I encounter fit the solution of using existing CPAN modules, and because I don't consider myself knowledgeable enough about such things to break the rules yet.

      One should only break the rules when one understands why the rule exists.

      Like Ovid, I have broken every one of those rules, and more. (Personally, I love playing with soft references in production code, but I'm masochistic.) But, I will follow those rules in 99.9% of my code. The point is that most programmers shouldn't parse HTML with regexes most of the time. Heck, most shouldn't do it all of the time. And, if you do it, it should be modularized, packaged, and then never messed with again. :-)

      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://284328]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2019-10-19 21:51 GMT
Find Nodes?
    Voting Booth?