Re: RFC - Regular Expressions Tutorial, the Basics (for BEGINNERS)by demerphq (Chancellor)
|on Jan 24, 2007 at 13:32 UTC||Need Help??|
Its a little strange to me how few regexp tutorials start with the basics and move on from there. Maybe I'm too close to the trees or maybe im too advanced to see what a beginner would need, but it strikes me that ommitting the basics is a bad start.
There are five fundamental building blocks of a regular expression. They are "characters", "concatenation", "alternation", "grouping", and "kleene closure"
Characters are literal characters that must be matched. A character is matched by finding the leftmost occuring equivelent in the input string.
Concatenation is the principle that two characters are concatenated together when not seperated by an operator. Concatenation is implied in a pattern, there is no special operator for it, and has the lowest precedence of all operators except for alternation.
Alternation is the way to say "match this subpattern or that subpattern". It is denoted by putting a | symbol in between the two subpatterns. Alternation has the lowest precedence of all the operators.
Grouping is a way to combine multiple components into a self contained subpattern. Alternation is often place into a grouping construct. In perl grouping is denoted by putting the subpattern in a parenthesis.
Kleene closure is a special pattern that matches 0 or more subpatterns in a string. This is denoted by a postfix * operator, or in less technical terms by placing a * after the subpattern.
It turns out that many of the common tasks one would wish to perform with a regex are quite clumsy when restricted to such a sparse language. Therefore various extensions have been made which allow common constructs to be written more elegantly.
Its common to want to match 1 or more subpatterns. While this can be expressed using klene closure alone, it can be clumsy, therefore the postfix plus operator is provided. P+ is defined to match the same thing as PP*.
Its common to want to match any one of several characters at a given point in a string. Therefore the "character class" parenthetical construct is provided. [ABC] matches the same text that (A|B|C) matches. Note that this is restricted to single characters and not longer subpatterns.
The ability to optionally match something is a common requirement. Therefore the ? postfix operator is provided. P? matches the same thing as (P|) matches. (P or nothing)
Anyway, just some thoughts for you. Obviously it all could use more polishing, buts its basic material that i think makes it easier to understand regexes.