http://www.perlmonks.org?node_id=89547

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Since some may not have access to Jeffery Friedl's "Mastering Regular Expressions" or can't afford to buy it (i.e., casualties of the dot com bust), could someone EXPLAIN, in plain English ("talk to me like I'm a five year old"), WHY one would use the pattern:
delimiter normal* (?:special normal*)* delimiter
when dealing with multiple level quantifiers in regular expressions. There are enough references to and examples of this technique on this site but no write-ups explaining how it effects the PERL regex engine. Thank you in advance.

Replies are listed 'Best First'.
Re: Unrolling the loop technique
by mirod (Canon) on Jun 19, 2001 at 13:23 UTC

    OK, I'll try my best at explaining the "unrolling the loop" mechanism (I don't have MRE at hand, so feel free to correct me if I am blatantly wrong!):

    I use this technique in 2 cases:

    • When I want to match a string between 2 1-char delimiters, but the end delimiter can be included in the string through an escape mechanism. For example a double-quoted string that can include the double-quote if backslashed. In this case the "naive" m/"[^"]*"/ would not work.
    • When I want to match a string between 2 delimiters but the end delimiter is several characters long, such as in C comments (/*comment*/). In this case /*(.*?)*/ would work but it might be slow (I think, I have not benchmarked it).

    In the first case here is how you want to match:

    • the start delimiter,
    • then an "easy" string (the normal* part): anything that does not include the escape character,
    • if you find the escape character then you want to skip the next character (the special part), whatever it is, and then match the following "easy" string (the second normal*), until you get to the next escape character or the end delimiter,
    • finally you match the end delimiter

    Now if you want to match a string with a multi-character end delimiter here is how to do it:

    • match the start delimiter,
    • match anything until you get to the first character of the end delimiter,
    • if that character is not followed by the rest of the end delimiter then you can keep on matching "regular" characters until the next time you find that character,
    • match the end delimiter

    A potential pitfall is that you want to make sure you don't consume the characters just after the first character of the end delimiter, or things like **/ (the first character of the end delimiter is there twice in a row, once as a regular character and once as the start of the end delimiter) would not be processed properly.

    I guess a couple of examples might be appropriate.

    First matching a double-quoted string, double quotes can be escaped using \":

    #!/bin/perl -w use strict; while( <DATA>) { next if(/^\s*#/); # skip comments in DATA chomp; # split data into string to match and expected result(s) my( $string, @expected)= split /\s*=>\s*/; while( $string=~ m{" # the start deli +miter ([^\\"]* # anything but t +he end of the string or the escape char (?:\\. # the escape + char preceeding an escaped char (any char) [^\\"]* # anything b +ut the end of the string or the escape char )*) # repeat "}gx) # the end delimi +ter { my $match= $1; my $expected= shift @expected; unless( $match eq $expected) { print "unexpected result line $.: found /$match/, expectin +g /$expected/\n"; } } } __DATA__ # string to match => expected results(s) toto "a string" tata => a string toto "a string" => a string toto "a string" tata => a string toto "a \" string" tata => a \" string toto "\" string" tata => \" string toto "a\"" tata => a\" toto "\"" tata => \" toto "\"\"" tata => \"\" toto "string 1" "string 2" tata => string 1 => string 2 toto "string 1 => toto "string 1" "string => string 1 toto "tata\\" tutu => tata\\ toto "tata\\\"" tutu => tata\\\"

    And now how to match C-like comments:

    #!/bin/perl -w use strict; while( <DATA>) { chomp; next if(^\s*#); # skip comments # split the data into the string to match and the expected result( +s) my( $string, @expected)= split /\s*=>\s*/; while( $string=~ m{/\* # the delimite +r ([^*]* # anything but + the beginning of the delimiter (?:\*(?!>/) # the beginn +ing of the delimiter, not preceeding the rest of the delimiter # (?!>/) +means "not before /, do not use the next char) [^/]* # anything b +ut the beginning of the delimiter )*) # repeat \*/}gx) # the end of t +he delimiter { my $match= $1; my $expected= shift @expected || ''; unless( $match eq $expected) { print "unexpected result line $.: found /$match/, expectin +g /$expected/\n"; } } } __DATA__ # string => result(s) toto => toto /*foo*/ tata => foo /*foo*/ tata => foo toto /*foo*/ => foo toto /*foo*bar*/ => foo*bar toto /*foo**/ => foo* toto /**/ => /***/ => * /*/*/ => /
      In this case /*(.*?)*/ would work but it might be slow (I think, I have not benchmarked it).

      I would strongly prefer m#/\*(.*?)\*/#s over the unrolled loop version if that is the whole regex (and I'd try to make that the whole regex precisely because I could then avoid using the unrolled regex).

      The real problem with this simple technique comes when you try to use it as part of a larger regex. For example, let's say you want to extract "comment blocks", that is, a C-style comment that starts at the beginning of a line and ends at the end of a line. Using m#^/\*(.*?)\*/$#msg sure seems an easy way, and it even works for a lot of cases. However, consider this unlikely sample input:

      /* This is correctly matched */ /* This: */ runcode(); /* gets included in the "comment" */
      which would return this list:
      ( " This is correctly matched ", ' This: */ runcode(); /* gets included in the "comment" ' )
      You see that .*? matches as little as possible but will prefer to match more if matching more will allow the entire regex to match (or to match earlier) when less causes the regex as a whole to fail (or to match later).

      If I find myself wanting to use the loop unrolling technique, then I usually try to rework the problem by parsing in smaller chunks. Though, if these chunks start getting really small (like my parser starts having to deal with single characters in lots of cases), then I may use some of the simplest examples of unrolled regex loops.

              - tye (but my friends call me "Tye")
Re: Unrolling the loop technique
by mugwumpjism (Hermit) on Jun 19, 2001 at 13:27 UTC

    Presumably, because they want to match the strings:

    delimiter norma delimiter delimiter normallll special norma delimiter delimiter normall special normalllspecial norma delimiter delimiter normallllll special normal delimiter delimiter normal special normal delimiter

    Without $1 being set to anything (that's what the "?:" after the opening bracket does).

    Seriously, give us one or two of the actual regular expressions that are confusing you. The way you have it written, it's exactly equivalent to:

    delimiter (?:special|normal)* delimiter

    Assuming that each word is supposed to be a regex atom.

    Update: Re-read the question :-} The reason why you'd want to do this would be if "special" is a pattern that matches the escaped delimiter and an escaped escape pattern, so that you can include the delimeters in the data.

    Update: Someone just suggested that you'd want to unroll it for speed's sake... I recommend doing some speed profiling and seeing for yourself the kind of difference it makes... then deciding whether obfuscating your regular expressions is worth that speed increase. Hint: regular expressions are first compiled to an internal "deterministic acceptor", so it makes very little difference.

      There may be much more than a little difference. Consider the output from the following program when run under perl5.005:
      #!perl -wl use strict; my $good = <<EOT; a quote " with some text after it and then another quote " EOT my $bad = <<EOT; a quote " with some text after it but without another quote EOT my $obvious = qr/"(?:\\.|[^"\\]+)*"/; my $unrolled = qr/"[^"\\]*(?:\\.[^"\\]*)*"/; $| = 1; print "\$good =~ \$unrolled"; print $good =~ $unrolled; print "\$good =~ \$obvious"; print $good =~ $obvious; print "\$bad =~ \$unrolled"; print $bad =~ $unrolled; print "\$bad =~ \$obvious"; print $bad =~ $obvious; print "done";
      The problem should become clear by the time it finishes. ;)

      Jeffrey Friedl goes into much more detail in Mastering Regular Expressions, which is where I grabbed the regexes I used above.

      However, if you run this program under perl5.6, you won't have as much time to figure out the issue, because there are improvements to the regex engine in that version which fix the problem!

Re: Unrolling the loop technique
by Anonymous Monk on Jun 20, 2001 at 00:15 UTC
    Thank you for your responses. They're very helpful, but I don't think I was clear enough in my initial post. My apologies. What I really would like is an explanations of the "mechanics" of the PERL regex engine in the "unrolling the loop" technique. Why is it more efficient (i.e., no or least amount of backtracking, no infinite or nearly infinite loops, etc.) in the problem situations that some of you have mentioned or in the situation that led Jeffery Friedl to invent it in the first place. Also, this would help me and possibly others understand when its less efficient as respondent Tye has indicated.

    Basically, I'm after an understanding of the ghostly depths of what going on at the engine level between the surface dazzle of completing expressions. The fundamentals.

    Thanks again.