Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Picking the best way....

by tinman (Curate)
on May 05, 2001 at 01:59 UTC ( #78138=perlmeditation: print w/ replies, xml ) Need Help??

Fellow monks, this is a meditation on something that I've noticed many people ask for, and do, when they have to solve a specific text processing task using Perl...
They use regexes!

Now, many of you may be wondering, what is *his* problem ? regexes are great, right ? Sure they are... but perhaps the statement I made above should be amended to read, "they use regexes without a thought for any other easier to use options available to them"... and this is where this meditation (some would be justified in calling this a rant) starts....

Consider the mantra at Perlmonks, use CGI, use CGI, use CGI.. don't ever think about rolling your own code for parsing the input parameters.. The reasons for which this statement is made are equally applicable to any number of tasks... specifically, the one I would wish to address is that of parsing/munging/extracting elements from HTML...

Just today, I saw someone ask for a regex to extract HREF blocks from an HTML file.. and I wondered, why ? Is it necessary to use a regex for something that can as easily be abstracted away to a module built for the task ? and the answer is, of course, an emphatic NO!...

Consider a recent node about why its not acceptable to avoid the use of CGI.pm... can't the same be said for this task ? of course it can... so, my new mantra for any/most who ask for a quickie regex is, use a module, use a module, use a module..

CPAN is packed with modules for parsing HTML, my favourite being HTML::TokeParser.. some others that are definitely worth looking at include |HTML::Parser, and as mentioned here, HTML::Filter.. any or all of those modules can be used directly for token recognition and munging of HTML in general, and they *can* have significant advantages over a first pass regex written by an average Perl user... ie: they're pretty fast, they're less error prone, they catch the edge cases that most regex authors don't immediately think of handling, and for the most part, these modules have been "eyeballed" by countless others, so your efforts have already been partially validated by others... not so with a regex...

So, when next you think of doing something complicated with HTML munging, head over to CPAN and take a look around there... then (if you must) think about rolling up your sleeves and writing a regex.. The time spent at CPAN is time well spent..

feels much better after letting that off his chest.. thanks for reading..

Comment on Picking the best way....
Re: Picking the best way....
by Masem (Monsignor) on May 05, 2001 at 02:06 UTC
    I completely agree: code reuse is very very very good. The problem that is associated with a lot of new posters is that the mention of "module" for those that aren't running root on the machine of interest immediately seem to forget that idea because "I can't install modules". I tried to start writing a Tutorial on how to install modules for a number of situations (root or non-root, unix/win/mac), but I don't have enough accessiblity to some of the cases that I think would be important to cover. But I still think this is a very much needed idea to improve the use of modules.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
(tye)Re: Picking the best way....
by tye (Cardinal) on May 05, 2001 at 02:55 UTC

    This has come up before. Not all modules are as well written nor as essential as as CGI.pm.

    As for dealing with XML or HTML using regular expressions... I did that just the other day. I had a well-defined set of HTML to deal with and finding the right HTML-parsing module would have probably taken more time than rolling my own regexen did (and then I'd have to learn how to use that module and then apply that to the problem at hand).

    If you are going to end up dealing with not-previously known XML/HTML, then I strongly recommend a module. Unfortunately, the module landscape in that area is still a bit rocky and undermapped. Several modules to choose from, most of which have some problems at least in some situations.

    Also take a look at Why I like functional programming for another example of not using a module to parse HTML to excellent effect. It is one of many cases that remind me that we often deal with something that is nearly HTML or XML, which can make all the modules useless.

    I'm not disagreeing with your recommendation to try to use modules. That is an excellent idea. I'm just advocating moderation. (:

            - tye (but my friends call me "Tye")

      I am not disagreeing with your recommendation to use moderation. That is an excellent idea. I am just advocating _extreme_caution ;--)

      Especially when dealing with XML, which is a deceiptively simple format.

      You can certainly use regexps to write a throw-away hack, which is going to be used only once, on very well known XML data, ideally generated by code you have also written yourself. That's about it! And it doesn't happen that often.

      Using regexps on any thing else means that sooner or later you will come accross something that's completely legal XML, but that completely breaks your code. And believe me, if it is legal XML (and most likely even if it is not) it is bound to pop up in your data. You can hava a look at On XML Parsing for just a quick list of what can go wrong.

      A last word: if you are dealing with something that is nearly (...) XML, do yourself a favor: use 2 steps: First get from the nearly-thingie to the real stuff, and then use an XML module. It would be even better if you could refuse the data alltogether because it is not valid!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://78138]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-07-26 11:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls