Regular Expressions

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expressions by imp (Priest) on Nov 27, 2006 at 14:08 UTC
YAPE::Regex::Explain is very helpful when debugging a regular expression. You can use it like this: `use strict; use warnings; use YAPE::Regex::Explain; my $regexp = qr/^(.?)((=<)\|[<=>])(.)/; my $exp = YAPE::Regex::Explain->new($regexp); print $exp->explain;` [download] The output is as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .? any character except \n (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- ( group and capture to \3: ---------------------------------------------------------------------- =< '=<' ---------------------------------------------------------------------- ) end of \3 ---------------------------------------------------------------------- \| OR ---------------------------------------------------------------------- [<=>] any character of: '<', '=', '>' ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ( group and capture to \4: ---------------------------------------------------------------------- . any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \4 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] Another useful tool is to use the 'x' modifier to allow whitespace in the regex. I consider regex to be an extremely dense programming language, and without the whitespace to organize your thoughts it is very easy to get lost in the noise. Here is your regex, using the 'x' modifier: `my $re = qr{ ^(.?) ( (=<) \| [<=>] ) (.) }x;` [download] When writing a large regex it is a tradeoff between accuracy and readability. It is sometimes tempting to keep it simple so the regex is maintainable. 'x' is useful for addressing this problem, as you can put comments in the regex. Here is a revised regex for you: `my $re = qr{ ^ # Beginning of line \s* # Optional whitespace ([a-zA-Z0-9_]+) # Capture(1) Alphanumeric LHS \s* # Optional whitespace ( # Capture(2) either: [<>!]= # <=, >=, != \| [<>=] # <, >, = ) \s* # Optional whitespace ([a-zA-Z0-9_]+) # Capture(3) Alphanumeric RHS }x;` [download] And if you would like to make it more readable you can separate some of the tokens into other variables, like this: `my $operand = '[a-zA-Z0-9_]+' ; my $re = qr{ \A # Beginning of line \s* # Optional whitespace ($operand) # Capture(1) Alphanumeric LHS \s* # Optional whitespace ( # Capture(2) either: [<>!]= # <=, >=, != \| [<>=] # <, >, = ) \s* # Optional whitespace ($operand) # Capture(3) Alphanumeric LHS }x;` [download] I noticed that your example input allowed '=>' instead of '>=', maybe in your locale that is allowed? Here is a functional test script for you. It matches the items documented in the regex, but does not match '=>' or '=<' (Is that allowed in your locale?) use strict; use warnings; my $operand = '[a-zA-Z0-9_]+' ; my $re = qr{ ^ # Beginning of string \s* # Optional whitespace ($operand) # Capture(1) Alphanumeric LHS \s* # Optional whitespace ( # Capture(2) either: [<>!]= # <=, >=, != \| [<>=] # <, >, = ) \s* # Optional whitespace ($operand) # Capture(1) Alphanumeric LHS }x; while (my $line = <DATA>) { my ($lhs,$operator,$rhs) = $line =~ $re; if ($line =~ $re) { my ($lhs,$operator,$rhs) = ($1,$2,$3); printf " (%4s) (%2s) (%4s)\n", $lhs, $operator, $rhs; } } __DATA__ a=b a!=b a<b a>b a=>b a=<b [download]	[reply] [d/l] [select]
Re: Regular Expressions by johngg (Canon) on Nov 27, 2006 at 14:17 UTC
Given that in your examples the operands are in the character class `[a-z]` you could `split` on the boundary between operand and operator using look-behinds and look-aheads. You `split` either at a point preceded by the character class `[a-z]` and followed by the negated character class `[^a-z]` or vice versa, like this `use strict; use warnings; print map {qq{$_->[0] -- $_->[1] -- $_->[2]\n}} map { [ split m {(?x) (?: (?<=[a-z])(?=[^a-z]) \| (?<=[^a-z])(?=[a-z]) ) } ] } map {chomp; $_} <DATA>; __END__ a=b a!=b a<b a<=b a>b a>=b` [download] which gives this output `a -- = -- b a -- != -- b a -- < -- b a -- <= -- b a -- > -- b a -- >= -- b` [download] Given a more complex set of operators you would probably be better off taking grinder's approach of setting up a regular expression that matches and captures any operator. As complexity increases a parser solution becomes more appropriate. I hope this is of use. Cheers, JohnGG	[reply] [d/l] [select]
Re: Regular Expressions by grinder (Bishop) on Nov 27, 2006 at 14:03 UTC
=< doesn't look like any operator I've ever met, but still, assuming you want to match =, !=, <, >, <= and >=, then note that ther operands involving less than and greater than are different, in that they may be followed by an = (equals), accounting for two more operators for free. That gives us `[<>]=?` That leaves = and !=. This is just equals, maybe preceded by an excla. This gives `!?=` Now all the operators have been accounted for. Putting them together in a capturing group with an alternation gives: `([<>]=?\|!?=)` Dividing the atoms you want to match into different groups is usually the best way of coming up with an expression that matches all of them. Also, you want to consider patterns that share a similar beginning, since this way you'll end up with a regular expression that doesn't have to backtrack. If you really meant to match =<, then with the above approach you should be able to come up with something that works. Look at all the operators that start with an =, and then the remaining operators that don't. • another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re^2: Regular Expressions by ambrus (Abbot) on Nov 27, 2006 at 21:48 UTC
I've had the misfortune to meet the `=<` operator: Prolog uses it for less than or equal to. However, `=>` is not a relational operator I've ever met, I've only seen it used as an arrow.	[reply] [d/l] [select]
Re: Regular Expressions by Locutus (Beadle) on Nov 27, 2006 at 14:39 UTC
You should find the second operand in `$4` after any successful match of your regular expression. If you omit the unnecessary brackets around `=<` it'll show up in `$3` - where I guess you were expecting it. (BTW: You're probably not trying to parse Perl code, do you? Otherwise you want to look for "`a==b`" instead of "`a=b`", for "`a>=b`" instead of "`a=>b`", and for "`a<=b`" instead of "`a=<b`".) Anyway, your current regular expression won't recognize the operators `!=` or `=>` at all. Unless your program is supposed to recognize and handle the input of undefined operators `split /(\W+)/` (as suggested above) might be a sufficient alternative. If you want to throw an error message on something like `a=[b` you can use `/^(.?)(!=\|=>\|=<\|[<=>])(.)/` and react appropriately if there's no match.	[reply] [d/l] [select]
Re: Regular Expressions by Anonymous Monk on Nov 27, 2006 at 12:46 UTC
Why are you using `^`? Try `split /(\W+)/`	[reply] [d/l] [select]
Re: Regular Expressions by Moron (Curate) on Nov 27, 2006 at 17:22 UTC
I second JohnGG's parser approach and would go further and say that there seems to be a questionable hobby among some people of trying to solve things with a single regexp and creating a real Godzilla of a regexp in the process that would be rather difficult to maintain in the future. The advantage of a parser is that the code follows directly from the language rules and anyway is normally needed to include a thrower to move past flexible whitespace and comments and a lexer to get and identify language elements such as a quoted string, an operator, an identifier etc., it being that a combination of possibilities may be allowable at each step in the parser's run. -M Free your mind	[reply]


P is for Practical
	PerlMonks