Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
when dealing with multiple level quantifiers in regular expressions. There are enough references to and examples of this technique on this site but no write-ups explaining how it effects the PERL regex engine. Thank you in advance.delimiter normal* (?:special normal*)* delimiter
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Unrolling the loop technique
by mirod (Canon) on Jun 19, 2001 at 13:23 UTC | |
OK, I'll try my best at explaining the "unrolling the loop" mechanism (I don't have MRE at hand, so feel free to correct me if I am blatantly wrong!): I use this technique in 2 cases: In the first case here is how you want to match: Now if you want to match a string with a multi-character end delimiter here is how to do it: A potential pitfall is that you want to make sure you don't consume the characters just after the first character of the end delimiter, or things like **/ (the first character of the end delimiter is there twice in a row, once as a regular character and once as the start of the end delimiter) would not be processed properly. I guess a couple of examples might be appropriate. First matching a double-quoted string, double quotes can be escaped using \":
And now how to match C-like comments:
| [reply] [d/l] [select] |
by tye (Sage) on Jun 19, 2001 at 23:49 UTC | |
In this case /*(.*?)*/ would work but it might be slow (I think, I have not benchmarked it). I would strongly prefer m#/\*(.*?)\*/#s over the unrolled loop version if that is the whole regex (and I'd try to make that the whole regex precisely because I could then avoid using the unrolled regex). The real problem with this simple technique comes when you try to use it as part of a larger regex. For example, let's say you want to extract "comment blocks", that is, a C-style comment that starts at the beginning of a line and ends at the end of a line. Using m#^/\*(.*?)\*/$#msg sure seems an easy way, and it even works for a lot of cases. However, consider this unlikely sample input: which would return this list: You see that .*? matches as little as possible but will prefer to match more if matching more will allow the entire regex to match (or to match earlier) when less causes the regex as a whole to fail (or to match later). If I find myself wanting to use the loop unrolling technique, then I usually try to rework the problem by parsing in smaller chunks. Though, if these chunks start getting really small (like my parser starts having to deal with single characters in lots of cases), then I may use some of the simplest examples of unrolled regex loops. - tye (but my friends call me "Tye") | [reply] [d/l] [select] |
Re: Unrolling the loop technique
by mugwumpjism (Hermit) on Jun 19, 2001 at 13:27 UTC | |
Presumably, because they want to match the strings:
Without $1 being set to anything (that's what the "?:" after the opening bracket does). Seriously, give us one or two of the actual regular expressions that are confusing you. The way you have it written, it's exactly equivalent to: delimiter (?:special|normal)* delimiterAssuming that each word is supposed to be a regex atom. Update: Re-read the question :-} The reason why you'd want to do this would be if "special" is a pattern that matches the escaped delimiter and an escaped escape pattern, so that you can include the delimeters in the data. Update: Someone just suggested that you'd want to unroll it for speed's sake... I recommend doing some speed profiling and seeing for yourself the kind of difference it makes... then deciding whether obfuscating your regular expressions is worth that speed increase. Hint: regular expressions are first compiled to an internal "deterministic acceptor", so it makes very little difference. | [reply] [d/l] [select] |
by chipmunk (Parson) on Jun 19, 2001 at 18:40 UTC | |
The problem should become clear by the time it finishes. ;) Jeffrey Friedl goes into much more detail in Mastering Regular Expressions, which is where I grabbed the regexes I used above. However, if you run this program under perl5.6, you won't have as much time to figure out the issue, because there are improvements to the regex engine in that version which fix the problem! | [reply] [d/l] |
Re: Unrolling the loop technique
by Anonymous Monk on Jun 20, 2001 at 00:15 UTC | |
Basically, I'm after an understanding of the ghostly depths of what going on at the engine level between the surface dazzle of completing expressions. The fundamentals. Thanks again. | [reply] |