Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have two regular expressions (thanks to you guys explaining in great detail how this all works) that to my mind are logical equivalent (well excluding one case, but I think I want the extra case anyway). The expression I would like is:
/.*(\[*\w*\@*\-*\w*[$ #\%>~]\]|\\\[\\e\[0m\\\] \[0m)\s?/
What I think it does is:
    match as much as possible
    possibly match [
    possibly match some word characters
    possibly match some @
    possibly match some -
    possibly match some word characters
    match either $ ' ' # % > or ~
    possibly match ]
    or it matches weird escape sequence.

    and maybe a whitespace

Now I think that it should be logically equivalent to the expression below (except for the possibility of having [w@-w ] which was not allowed before, but I am ok with that due to serious clarity and mantainibility issues with this)
/.*([$ #\%>~]|\[*\w*\@*\-*\w*\%\]*|\[*\w*\@*\-*\w*#\]*|\[*\w*\@*\-*\w* +\$\]*|\[*\w*\@*\-*\w*>\]*|\\\[\\e\[0m\\\] \[0m)\s?/

Tomorrow I will probably convert most (if not all) of the * to ? because somewhere I heard greedyness is the root of all evil.
Thanks for your patients

Replies are listed 'Best First'.
Re: regex logical equivalence?
by davido (Cardinal) on Feb 04, 2004 at 09:17 UTC
    Before I get into demonstrating how to use YAPE::Regex::Explain I need to point out a few mistakes you're consistantly making.

    .* is rarely necessary at the beginning of a RE. You probably don't need it as the first token of your RE's unless you are later using $&, or unless you're wrapping it in parens and using a $1 (etc) capturing variable. See Death to Dot Star for additional reading on this subject.

    ? is not a non-greedy substitute for *. *? is the nongreedy zero-or-more quantifier.

    The * quantifier will allow empty strings to match. In other words, "\w*" will match one, two, hundreds, thousands of word characters, but it will also match no characters at all. Is this what you want? Maybe you want the + quantifier instead.

    Ok, here we go again with the deciphering. This time I'm not going to do it by hand, but rather will demonstrate effective use of a great module:

    use strict; use warnings; use YAPE::Regex::Explain; #my $exp = YAPE::Regex::Explain->new($REx)->explain; my $rex1 = qr/.*(\[*\w*\@*\-*\w*[$ #\%>~]\]|\\\[\\e\[0m\\\] \[0m)\s?/; my $rex2 = qr/.*([$ #\%>~]|\[*\w*\@*\-*\w*\%\]*|\[*\w*\@*\-*\w*#\]*|\[ +*\w*\@*\-*\w*\$\]*|\[*\w*\@*\-*\w*>\]*|\\\[\\e\[0m\\\] \[0m)\s?/; print YAPE::Regex::Explain->new($rex1)->explain; print YAPE::Regex::Explain->new($rex2)->explain; __OUTPUT__
    It doesn't look to me like they're completely equivilant.

    Also, please take an hour or so and read through perlrequick, perlretut and perlre. Until you've devoured those POD's you're going to be grasping at straws with regular expressions. If you really want to learn them inside and out, beg, buy, borrow, or steal (ok, don't steal) the Owls book, by Jeffrey Friedl, Mastering Regular Expressions. It's an O'Reilly book, and probably the best book ever written on regexps.

    Updated: Added link, suggested by broquaint.


Re: regex logical equivalence?
by Roger (Parson) on Feb 04, 2004 at 10:38 UTC
    The good thing about Perl regex is that you don't have to write the entire regex in one go, you can build it part by part programmatically. I often construct my regex in the maner as shown below to improve readability (at least to myself) and maintainability:
    use strict; use warnings; # allowed prompt characters my $prompt_regex = '[>$%@~# ]'; # allowed characters in the prompt my $allowed_regex = '[\w@\-\.]*'; # a list of allowed prompt patterns my @patterns = ( '\[?' . $allowed_regex . $prompt_regex . '\]?', quotemeta '\\[\\e[0m\\] [0m', ); # build my regex dynamically my $regex = '(' . join('|',@patterns) .')\s?'; # test the regex print "regex: /$regex/\n\n"; while (<DATA>) { chomp; /$regex/ && printf "%-30smatched: %s\n", $_, $1; } __DATA__ blah crap crap [roger@www#] blah blah foo [] crap crap \[\e[0m\] [0m foo bar

    And the output -
    regex: /(\[?[\w@\-\.]*[>$%@~# ]\]?|\\\[\\e\[0m\\\]\ \[0m)\s?/ blah matched: crap crap matched: [roger@www#] blah blah foo matched: [roger@www#] [] crap crap matched: [] \[\e[0m\] [0m foo bar matched: \[\e[0m\] [0m

Re: regex logical equivalence?
by MCS (Monk) on Feb 04, 2004 at 17:13 UTC

    Replacing all the * with ? is not necessarily what you want to do. For example \d* will match 1234567, 123, 1, etc... \d? will only match 1 in the previous example. It will also match nothing. (since the ? means optional)

    If you are talking about clarity and maintenance, I would go for the first one because of conciseness. However, I would encourage you to rewrite it. The second one has way too many options and will require a LOT of backtracking if it doesn't match the first time.

    While I don't know your exact setup, I can try and help guide you to rewriting this horrible mess of a regex. First, do you need to match multiple square brackets at the beginning? Odds are, you only need one... is it optional? If it's like my prompt, I only have one and it's always there so I could just put a \ infront of it (without the *) if it is optional, you could append it with "?". Now to \w*... can it be nothing? If not, it's probably better to use \w+ (one or more word characters) \@* can you have @@@@@? if not, I'd use \@? What might be better at the beginning would be something like /[\[\w\@\-]+[$\s#\%>~]\]/ Other things to think about are spaces and other characters that may be in the prompt. Also if you are not capturing the prompt, you can try replacing the ASCII escape sequences with // (nothing) before you parse it to help clean it up a bit.

    One final note, * isn't "evil" it's just used improperly *a lot*. If you find yourself using it, think... "is there another way I can do this?" Usually there is.

Re: regex logical equivalence?
by jarich (Curate) on Feb 05, 2004 at 06:14 UTC
    I think you might find it worthwhile to learn about /x at this point as these regular expressions could certainly do with some commenting. /x isn't hard or scary at all. All you have to do is rememeber to escape the whitespace you want and the #s. It makes regular expressions much easier to explain.

    Just to make things more confusing ;) I'm going to swap the order of these two expressions, so my first one will be the longer of the two and I'll work on the the shorter (my second - your first) as I think that's the one you wanted to focus on.

    To determine if your two regular expressions are suffiently equivalent we need to compare them.

    Now we need to consider what patterns will match one, but not the other... (I'm going to assume you are missing a * up there next to your ], if not, then these aren't very equivalent at all).
    • Any 1 space, $, #, %, > or ~ will be matched by both.
    • The escape sequence: \[\e[0m\\] [0m is allowed by both.
    • Each pattern: [w@-w$], [w@-w#], [w@-w%], [w@-w~] is allowed by both.
    • [w@-w ] is (as you shown) is allowed by the second but not the first (this is easy to fix)

    Like you, I can only spot this one significant difference between the two regular expressions (once you fix your typo).

    Note that this equivalence won't necessarily remain true if you change your quantifiers. In particular if you change all of your *s to ?s. If you want my opinion I suspect you're actually looking more for a regular expression like this:
    /.* # Stuff ( # START capturing to $1 \ # exactly 1 space | # OR \[? # 0 or 1 [ \w* # 0 or more word characters \@? # 0 or 1 @ [-\w.]* # 0 or more word chars, dots and hyphens eg +w-w.w-.w [$\#\%>~] # exactly 1 of $, #, %, > or ~ \]? # 0 or 1 ] | # OR \\\[\\e\[0m\\\]\ \[0m # the sequence: \[\e[0m\\] [0m ) # END of $1 \s? # 0 or 1 spaces /x

    But I may be wrong - you may not be interested in the dot at all. ;) I'm not 100% certain that you want the .* at the front though. Do you have some sample data for us?

    I hope you recognise that both expressions will match any string with a single space in it... which will be most strings....

    I hope this helps.


Re: regex logical equivalence?
by mrpeabody (Friar) on Feb 05, 2004 at 05:50 UTC
    Your biggest mistake is replacing a long regex that is nearly impossible to understand with an even longer one that is actually impossible to understand. Since you're rewriting the expression anyway, take the opportunity to expand and comment it using /x.

    You want your finished product to be good code, not "a mess like before, except worse".

Re: regex logical equivalence?
by Sol-Invictus (Scribe) on Feb 05, 2004 at 08:48 UTC
    From your other node you should appreciate the need for readability in regex. The guy that wrote the original regex didn't think much about readability at all.

    there are a couple of things you can do as well as using \X, like:

    different delimiters

    in a regex with a couple of escapes like this /\@\\\// you can change the /../delimiters to any character you want, as long as they are paired, like this !\@\\\/! or #\@\\\/#

    block quoting

    in a regex with lots of escaped characters use \Q...\E notation. This works in the same way that single quotes work for strings, meaning the characters are interpeted exactly as they are written - !\Q$@.\^[{\E! is more readable than !\$\@\.\\\^\[\{!

    <lecture mode = on>

    Finally as others have already pointed out, Perl's regex are very powerful and as such difficult to "just pick up", you have to read up on regex to even begin to understand how to use them - as a beginner you should probably read these first:

    perlrequick Perl regular expressions quick start perlretut Perl regular expressions tutorial perlfaq6 Regexes
    </lecture mode>

    ----Signature: You spend twenty years learning the spell that makes nude virgins appear in your bedroom, and then you're so poisoned by quicksilver fumes and half-blind from reading old grimoires that you can't remember what happens next.