|Perl: the Markov chain saw|
Capturing parenthesis and grouping square bracketsby Eily (Hermit)
|on Jun 17, 2013 at 23:00 UTC||Need Help??|
Some of you may have recognized the scent of Perl6 in the title, but it's actually something I first thought of a while back, when I was still a complete Perl profane. It is, however, about regexes. And to avoid misunderstandings, I make a distinction between regular expression, the 'simple' pattern matching format, and regexes, the regular expression superset provided by Perl.
This meditation is about parenthesis (round brackets) being capturing by default, and (?:*insert non captured group here*) being quite the mouthful. This is mostly addressed by Perl 6's fifth apocalypse where non capturing grouping is revealed to be done with square brackets, but I still think it should be the opposite.
It's quite simple actually: parenthesis are the obvious way to do grouping, because that's what they do pretty much everywhere in programming languages, even in math actually (that may be be the other way around now that I think about it). Parenthesis change which operations you read together, and tokenize expressions, and that's pretty much it. Of course you wouldn't have to search very far to find another use for parenthesis, as a matter of fact I'm already talking about Perl.
Since parenthesis are the obvious way to do it, someone may, like I did when I first tried working with Perl, use them without checking that part of the documentation and thus not know that capturing groups have been made. This probably isn't a performance issue, because if you can't bother to read the documentation well enough to know about that, there probably are other things you fail to optimize. It's an issue when it can break code, the example I came across is split.
"If the PATTERN contains parentheses, additional array elements are created from each matching substring in the delimiter."
So if you don't know your Perl well and have something like
and decide that batmen should be allowed to be separated by more or less than four 'na's, knowing already what * and + do, you may write
And there you end up with 'na's in your @batmen, what a pity!
On the other hand, if you don't know Perl much and read something like (Perl 5) /(?:bat|spider)man/ or (Perl 6) /super[tramp|man| time]/ you may think that something strange is happening, when you are just grouping.
This is an issue for people who know regular expression, that would either try to use patterns created with only that knowledge, and would use parenthesis because that's how it's suppose to be done, and might come up with something unexpected in a split or in some way I haven't thought about. My previous example still stands, those people wouldn't understand [ch?|b]ar or I wish I was the m(?:oo)+n when the unknown syntax doesn't mean that a Perl feature that doesn't exist in regular expressions has been used. This paragraph should actually have been my main point.
So I was wondering, is there was any reason for parenthesis to have a capturing feature in regexes except for the fact that this is how it has always been done.
And I'm afraid the ugly truth behind all this, is that I'm french, with an 'azerty' keyboard, where [ and ] are harder to type than ( and ), and I don't want the extra effort for not using a feature; because I'm lazy :P .(Edit : this is supposed to be taken as I joke. I do realize now that it's only obvious if you've used an AZERTY keyboard, and know that typing [ isn't any trouble)
Edit : I just found part of the answer on my own. Some other regular expression extensions use capturing parenthesis as well inside of the pattern, so that you can have \x tokens. I just forgot to get my head out from the base of regular expressions. I hope I didn't bore those who read all that too much ^^".