Regexes vs. Maintainability

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Some of the comments in a node about a regex problem got me to thinking about the maintainability of regexes, versus alternate solutions. The regex in question, after some patching (with heartfelt thanks to Dermot and others for mega-help), looks like the following:

        $data =~ s/
                    (                     # Capture to $1
                        <a\s              #     <a and a space charact
+er
                        (?:               #     Non-capturing parens
                            [^>](?!href)  #         All non > not foll
+owed by href
                        )*                #         zero or more of th
+em
                        .?
                        href\s*           #     href followed by zero 
+or more space characters
                    ) 
                    (                     # Capture to $2
                        &\#61;\s*         #     = plus zero or more sp
+aces
                        (                 # Capture to $3
                            &[^;]+;       #     some HTML character co
+de (probably " or ')
                        )?                #     which might not exist
                        (?:               #     Non-grouping parens
                            .(?!\3)       #     any character not foll
+owed by $3
                        )+                #     one or more of them
                        .?
                        (?:
                            \3            #     $3 
                        )?                #     (which may not exist)
                   )
                   (                      # Capture to $4
                        [^>]+             #     Everything up to final
+ >
                        >                 #     Final >
                   )
                 /$1 . decode_entities($2) . $4/gsexi;
[download]

Note that the regex is complicated enough that I've even indented the comments to help some poor programmer behind me maintain it. As it turns out, it still has two very subtle problems (which are irrelevant to this discussion) which arise only under rare circumstances. How would you even find those problems? Heck, if I were really evil, I could put the regex on one line and make the task virtually impossible for the average programmer:

$data =~ s/(<a\s(?:[^>](?!href))*.?href\s*)(&\#61;\s*(&[^;]+;)?(?:.(?!
+\3))+.?(?:\3)?)([^>]+>)/$1.decode_entities($2).$4/gsei;
[download]

When I made the original post, tilly pointed out right away that he wouldn't use a regex to solve the problem (gasp!). That got me to thinking: since I love regex, I tend to employ them a lot. They're fast (if properly written), but many programmers don't grok them. Heck, even some of my simpler regexes are complicated:

$number =~ /((?:[\d]{1,6}\.[\d]{0,5})|(?:[\d]{0,5}\.[\d]{1,6})|(?:[\d]
+{1,7}))/;
[download]

That one just guarantees that a user-entered number fits my format. Aack!

tilly's comment, however, got me to thinking: how do Perlmonks create maintainable regexes, or do they avoid them in favor of more obvious solutions? I pride myself on writing clear, maintainable code with tons of comments. My beloved regexes, however, are the fly in my ointment of clarity. How do YOU deal with this?

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Back to Seekers of Perl Wisdom