http://www.perlmonks.org?node_id=34477

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Some of the comments in a node about a regex problem got me to thinking about the maintainability of regexes, versus alternate solutions. The regex in question, after some patching (with heartfelt thanks to Dermot and others for mega-help), looks like the following:
$data =~ s/ ( # Capture to $1 <a\s # <a and a space charact +er (?: # Non-capturing parens [^>](?!href) # All non > not foll +owed by href )* # zero or more of th +em .? href\s* # href followed by zero +or more space characters ) ( # Capture to $2 &\#61;\s* # = plus zero or more sp +aces ( # Capture to $3 &[^;]+; # some HTML character co +de (probably " or ') )? # which might not exist (?: # Non-grouping parens .(?!\3) # any character not foll +owed by $3 )+ # one or more of them .? (?: \3 # $3 )? # (which may not exist) ) ( # Capture to $4 [^>]+ # Everything up to final + > > # Final > ) /$1 . decode_entities($2) . $4/gsexi;
Note that the regex is complicated enough that I've even indented the comments to help some poor programmer behind me maintain it. As it turns out, it still has two very subtle problems (which are irrelevant to this discussion) which arise only under rare circumstances. How would you even find those problems? Heck, if I were really evil, I could put the regex on one line and make the task virtually impossible for the average programmer:
$data =~ s/(<a\s(?:[^>](?!href))*.?href\s*)(&\#61;\s*(&[^;]+;)?(?:.(?! +\3))+.?(?:\3)?)([^>]+>)/$1.decode_entities($2).$4/gsei;
When I made the original post, tilly pointed out right away that he wouldn't use a regex to solve the problem (gasp!). That got me to thinking: since I love regex, I tend to employ them a lot. They're fast (if properly written), but many programmers don't grok them. Heck, even some of my simpler regexes are complicated:
$number =~ /((?:[\d]{1,6}\.[\d]{0,5})|(?:[\d]{0,5}\.[\d]{1,6})|(?:[\d] +{1,7}))/;
That one just guarantees that a user-entered number fits my format. Aack!

tilly's comment, however, got me to thinking: how do Perlmonks create maintainable regexes, or do they avoid them in favor of more obvious solutions? I pride myself on writing clear, maintainable code with tons of comments. My beloved regexes, however, are the fly in my ointment of clarity. How do YOU deal with this?

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Replies are listed 'Best First'.
RE (tilly) 1: Regexes vs. Maintainability
by tilly (Archbishop) on Sep 29, 2000 at 15:05 UTC
    OK, here is a sample of how I have tackled problems like this in the past. As I said before, I have done this with small parse engines. I really should learn Parse::RecDescent, but to give you a flavour of what can be done, here is my solution to your original problem.

    Note the inclusion of closing out tags to create a balanced structure. That is impossible to do with a regex, but IMHO is very valuable.

    Also note how the configuration information winds up in a nice data structure. If someone was asked to allow another tag or new attributes, this would be very easy to modify.

    Plus I like making mistakes visible...

    The key to all of this? Regular expressions have their own control of logic flow with backtracking and all. If you want what it provides, they rock. But they don't scale to conceptually harder problems...

    use strict; use vars qw($raw @opened %ok_tag %unbal_tag); use HTML::Entities qw(encode_entities); %ok_tag = ( p => {}, br => {}, a => { accesskey => 1, charset => 1, coords => 1, href => 1, hreflang => 1, name => 1, tabindex => 1, target => 1, type => 1, }, font => { color => 1, face => 1, size => 1, }, h1 => {}, h2 => {}, h3 => {}, h4 => {}, h5 => {}, h6 => {}, ); %unbal_tag = map {($_, 1)} 'br', 'p'; # Takes an input string and returns it. It will leave alone # the tags in %ok_tags if they have only allowed attributes, # and will escape everything else. It will also insert # needed closing tags for tags not in %unbal_tag. I don't # have to do that, but I felt like it since regular # expressions cannot ever solve that problem. # # Oh, and this probably comes with bugs. That is what # you get for free though! :-) { my $raw; my @opened; sub scrub_input { $raw = shift; my $scrubbed = ''; @opened = (); # Grab a chunk of known OK data while ($raw =~ /([\s\w]*)/g) { $scrubbed .= $1; my $pos = pos($raw); # Search for a tag if ($raw =~ m=\G<(/?)([\w\d]+)=g) { my $is_close = $1; my $tag = lc($2); if (exists $ok_tag{$tag}) { if ($is_close) { # closing tag? $scrubbed .= _close_tag($tag); } else { $scrubbed .= _open_tag($tag); } } else { # This tag is not allowed pos($raw) = 0; } } # Escape if last /g match failed unless (pos($raw)) { if (length($raw) == $pos) { # EXIT HERE # return join '', $scrubbed, map "</$_>", reverse @opened; } else { my $char = substr($raw, $pos, 1); pos($raw) = $pos + 1; $scrubbed .= &encode_entities($char); } } } } sub _close_tag { my $tag = shift; # Check a couple of obvious conditions unless ($raw =~ /\G>/g) { # Oops! return ''; } if (exists $unbal_tag{$tag}) { return "</$tag>"; # Not needed but...*shrug* } # OK then, time to figure out which need to be closed my @searched; while (@opened) { my $open_tag = pop(@opened); unshift @searched, $open_tag; if ($open_tag eq $tag) { # Close em! return join '', map "</$_>", reverse @searched; } } # Closing a tag that was not opened? I don't think so! @opened = @searched; pos($raw) = 0; return ''; } sub _open_tag { my $tag = shift; my $allowed = $ok_tag{$tag}; my $text = "<$tag"; while ($raw =~ /\G(?: # Attribute or close \s+([\w\d]+)=("[^"]*"|[^">\s]+) | \s*> )/gx ) { if ($1) { if ($allowed->{lc($1)}) { $text .= " $1=$2"; } else { # Show the bad tag pos($raw) = 0; return ''; } } else { push @opened, $tag; return "$text>"; } } # If I get here, was not well-formed pos($raw) = 0; return ''; } }
RE: Regexes vs. Maintainability
by japhy (Canon) on Sep 29, 2000 at 00:16 UTC
    Well, I'll say you're going a bit overboard with the whitespace before your regex there... but anyway.

    Break your regex into parts:
    $d5 = qr/\d{0,5}/; $d6 = qr/\d{1,6}/; $d7 = qr/\d{1,7}/; $number =~ m{( $d5\.$d6 | $d6\.$d5 | $d7 )};


    $_="goto+F.print+chop;\n=yhpaj";F1:eval
      Hmm... I had posted this in discussion, since seekers didn't seem appropriate. Ah well.

      I like your version of the second regex. Not sure about the comment regarding going overboard on the whitespace. I prefer the extra whitespace because I feel the indentation adds clarity (especially in the comments, oddly enough). Too many programmers just throw things to gether with little or no explanation. I err on the side of overcommenting, but my programs are, I think, much easier to understand.

      What don't you like about the whitespace?

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

        You were indenting REALLY far in. It just made it seem more trouble than it was worth. If you use tabs in your actual code, you may want to consider changing to two or four spaces instead.

        $_="goto+F.print+chop;\n=yhpaj";F1:eval
Re: Regexes vs. Maintainability
by 2501 (Pilgrim) on Sep 29, 2000 at 03:51 UTC
    I think I clearly fall into the category of "not wanting to grok it". It LOOKS painful to comprehend your intentions with some of those regex....BUT as a programmer, if I had to maintain your code, I wouldn't mind having the mother of all regex as long as you very specifically commented what the regex is doing. That way, if something broke, I could apply tests around the regex to determine if it is the cause of the problem. I honestly think you can pull off writing complex regex and neat maintainable code as long as you really do comment well.

    Speaking of clear code, I think I prefer the regex in its packed form vs. having the comments in between. It is easier on the mind to read, and since that would force the comments to be in one spot as well, it would make them more clear as well.
    All in all, very impressive!!
Re: Regexes vs. Maintainability
by hil (Sexton) on Sep 29, 2000 at 05:42 UTC

    This is purely anecdotal, of course, but regexes are just plain fun - to write, anyway. Supporting other people's can be a major source of annoyance. How many of them are really understandable to the average coder?

    Personally, I prefer a general comment and then comments on specific parts only if the expression is particularly intricate (as in the example above). In any case, I would be interested to see how many people really bother to comment their regexes beyond "parses $var" or some such thing. I'm sure there are plenty of horror stories floating around.

      I tend to comment regexps like this:

      $path=~/^\s*([\w-]+) # elt \s*\[\s*\@ # [@ ([\w-]+) # att $1 \s*=\s*(["']) # = " (or ',) $2 (.*?) # value $3 \3\]\s*$/gx) # "] (or ')

      It works very well for moderately complex regexps.

RE: Regexes vs. Maintainability
by Jonathan (Curate) on Sep 29, 2000 at 17:29 UTC
    I'm not convinced that your regex's give a maintainance problem. My background is very much the Unix/Shell/Awk/C world (still love them all). IMHO the one thing they all have in commom is a basic philosophy of maximum power from mininum keystrokes. To my mind Perl belongs to this mindset. What you are doing is complicated. So what. If its well commented its supportable. Trying to avoid your REGEX's creates its own problems - much more code for a start. Anyway debugging REGEX's isn't usually too terrible as you can easily pull them out and run in a test harness. Apologies for the speeling. Friday afternoon just back from the Pub. Hic.


    "We are all prompted by the same motives, all deceived by the same fallacies, all animated by hope, obstructed by danger, entangled by desire, and seduced by pleasure." - Samuel Johnson