|Problems? Is your data what you think it is?|
Picking the best way....by tinman (Curate)
|on May 05, 2001 at 01:59 UTC||Need Help??|
Fellow monks, this is a meditation on something that I've noticed many people ask for, and do, when they have to solve a specific text processing task using Perl...
Now, many of you may be wondering, what is *his* problem ? regexes are great, right ? Sure they are... but perhaps the statement I made above should be amended to read, "they use regexes without a thought for any other easier to use options available to them"... and this is where this meditation (some would be justified in calling this a rant) starts....
Consider the mantra at Perlmonks, use CGI, use CGI, use CGI.. don't ever think about rolling your own code for parsing the input parameters.. The reasons for which this statement is made are equally applicable to any number of tasks... specifically, the one I would wish to address is that of parsing/munging/extracting elements from HTML...
Just today, I saw someone ask for a regex to extract HREF blocks from an HTML file.. and I wondered, why ? Is it necessary to use a regex for something that can as easily be abstracted away to a module built for the task ? and the answer is, of course, an emphatic NO!...
Consider a recent node about why its not acceptable to avoid the use of CGI.pm... can't the same be said for this task ? of course it can... so, my new mantra for any/most who ask for a quickie regex is, use a module, use a module, use a module..
CPAN is packed with modules for parsing HTML, my favourite being HTML::TokeParser.. some others that are definitely worth looking at include |HTML::Parser, and as mentioned here, HTML::Filter.. any or all of those modules can be used directly for token recognition and munging of HTML in general, and they *can* have significant advantages over a first pass regex written by an average Perl user... ie: they're pretty fast, they're less error prone, they catch the edge cases that most regex authors don't immediately think of handling, and for the most part, these modules have been "eyeballed" by countless others, so your efforts have already been partially validated by others... not so with a regex...
So, when next you think of doing something complicated with HTML munging, head over to CPAN and take a look around there... then (if you must) think about rolling up your sleeves and writing a regex.. The time spent at CPAN is time well spent..