http://www.perlmonks.org?node_id=1039484

Some of you may have recognized the scent of Perl6 in the title, but it's actually something I first thought of a while back, when I was still a complete Perl profane. It is, however, about regexes. And to avoid misunderstandings, I make a distinction between regular expression, the 'simple' pattern matching format, and regexes, the regular expression superset provided by Perl.

This meditation is about parenthesis (round brackets) being capturing by default, and (?:*insert non captured group here*) being quite the mouthful. This is mostly addressed by Perl 6's fifth apocalypse where non capturing grouping is revealed to be done with square brackets, but I still think it should be the opposite.

It's quite simple actually: parenthesis are the obvious way to do grouping, because that's what they do pretty much everywhere in programming languages, even in math actually (that may be be the other way around now that I think about it). Parenthesis change which operations you read together, and tokenize expressions, and that's pretty much it. Of course you wouldn't have to search very far to find another use for parenthesis, as a matter of fact I'm already talking about Perl.

Since parenthesis are the obvious way to do it, someone may, like I did when I first tried working with Perl, use them without checking that part of the documentation and thus not know that capturing groups have been made. This probably isn't a performance issue, because if you can't bother to read the documentation well enough to know about that, there probably are other things you fail to optimize. It's an issue when it can break code, the example I came across is split.

"If the PATTERN contains parentheses, additional array elements are created from each matching substring in the delimiter."

So if you don't know your Perl well and have something like

$song = "nanananaaabatmannanananaaabatmannanananaaabatman"; @batmen = split /nananana+/, $song;
and decide that batmen should be allowed to be separated by more or less than four 'na's, knowing already what * and + do, you may write
$song = "nanananaaabatmannanaaabatmannananananaaabatman"; @batmen = split /(na)*na+/, $song;
And there you end up with 'na's in your @batmen, what a pity!

On the other hand, if you don't know Perl much and read something like (Perl 5) /(?:bat|spider)man/ or (Perl 6) /super[tramp|man| time]/ you may think that something strange is happening, when you are just grouping.

This is an issue for people who know regular expression, that would either try to use patterns created with only that knowledge, and would use parenthesis because that's how it's suppose to be done, and might come up with something unexpected in a split or in some way I haven't thought about. My previous example still stands, those people wouldn't understand [ch?|b]ar or I wish I was the m(?:oo)+n when the unknown syntax doesn't mean that a Perl feature that doesn't exist in regular expressions has been used. This paragraph should actually have been my main point.

So I was wondering, is there was any reason for parenthesis to have a capturing feature in regexes except for the fact that this is how it has always been done.

And I'm afraid the ugly truth behind all this, is that I'm french, with an 'azerty' keyboard, where [ and ] are harder to type than ( and ), and I don't want the extra effort for not using a feature; because I'm lazy :P .(Edit : this is supposed to be taken as I joke. I do realize now that it's only obvious if you've used an AZERTY keyboard, and know that typing [ isn't any trouble)

Edit : I just found part of the answer on my own. Some other regular expression extensions use capturing parenthesis as well inside of the pattern, so that you can have \x tokens. I just forgot to get my head out from the base of regular expressions. I hope I didn't bore those who read all that too much ^^".

Replies are listed 'Best First'.
Re: Capturing parenthesis and grouping square brackets
by ambrus (Abbot) on Jun 18, 2013 at 09:09 UTC

    I think round parenthesis for captures is a good idea, for two resons.

    The first is that perl makes simple things easy and complicated things possible. In simple regular expressions, capturing parenthesis occur more frequently than non-capturing ones. Once you write more complicated regexen with nested branches and the like, you have to learn the full syntax.

    The second is that in simple cases, you can use capturing parenthesis for parts of a regex that you don't want to capture, and just not refer to them later. In most programs this won't cause too much problems, extra capture groups gets a maintenance burden only when you use large regexen with lots of parenthesis, in which case you probably want named captures anyway, or could become a slight performance problem if you're micro-optimizing your script, and in both of these cases you should learn the full syntax and other details about the regex engine.

    The original reason for round parenthesis is simply historical. Round parenthesis were always capturing in ed and sed and awk and ex (they're spelt \( \) in ed and sed and ex though), and I think in ancient perl too, whereas non-capturing parenthesis are a newer invention.

Re: Capturing parenthesis and grouping square brackets
by hdb (Monsignor) on Jun 17, 2013 at 23:59 UTC

    Why not go all the way with your proposal?

    On a German layout keyboard (QWERTZ), the following characters are inconvenient to type: []{}\|. So your issue not only applies to regexes but to the language as a whole. A proper i18n solution would be required, where the German version would utilize the more accessible characters äöüÄÖÜ and ß instead. I am sure the productivity of the German Perl community would go through the roof. The Pörl community, I mean.

    In the meantime, once the developers stop looking at Perl > 5.20 and Perl 6, and get around to implement localized versions of Perl, however called, I would like to advertise the use of UK layout keyboards which have all required characters in convenient locations.

    (?: Just my two cents\.)

      The Pörl community actually made me laugh :) . That last comment about asking all of this because I'm lazy wasn't supposed to be taken seriously. If that was really a concern for me I'd write in Python, not Perl. And switching from AZERTY to QWERTY is done easily enough. I do see where the joke went wrong though, it's not obvious that the [ and ] characters are actually extremely simple to type, and just require to hold one key while pressing the other. I should have thought this through.

      My point with this meditation was actually about the unexpected behaviour that parenthesis may lead to in regexes. The only simple is example I could think about was the use of split, but I wondered if it couldn't have other repercussions.

        It is quite telling that you had to resort to split to find an example where the capturing feature of parantheses might lead to unexpected behavior. And how many programs that use split really utilize a regex rather than a simple record separator consisting of one or two characters?

        What is obvious to people and what is not, is far from obvious. Yes, parantheses are used to group items that belong together in expressions, also to express desired precedence. Perl programmers also expect them to capture parts of matches when used in regexes. A Perl beginner will not have problems with that feature most of the time. And my guess is that the beginner will switch from using $3 to $4 rather that using (?:...) (or [...]).

        So from my humble point of view, it is a non-issue. The over-riding principle should be that change involves costs and should be avoided unless the benefits significantly outweight these costs.

Re: Capturing parenthesis and grouping square brackets
by rjt (Curate) on Jun 18, 2013 at 11:00 UTC

    Sometimes, change is good. But for something as entrenched as fundamental Perl regular expression syntax, that change has to be really good.

    Even if we were inventing regular expressions in a vacuum with no existing code out there (which we most certainly are not. I wonder what point size it would take to stretch all of the capture brackets in current production code, end-to-end, to the moon...), I'm not convinced. I type (?:) a lot less often than I type (), so I believe the shorter code should support the more common case, as has been a long-standing design tenet of Perl.

Re: Capturing parenthesis and grouping square brackets
by BrowserUk (Patriarch) on Jun 17, 2013 at 23:11 UTC
    And I'm afraid the ugly truth behind all this, is that I'm french, with an 'azerty' keyboard, where and are harder to type than ( and ), and I don't want the extra effort for not using a feature; because I'm lazy :P .

    Do you type your regexes at 60 wpm?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Capturing parenthesis and grouping square brackets
by moritz (Cardinal) on Jun 18, 2013 at 16:49 UTC

    If one were to design a new programming language, from first principles and in the void Universe, I think your proposal has merit.

    However there are two points that strongly speak against adapting it in Perl 6:

    First, Perl 6 doesn't exist in a void Universe. Most programmers that try out Perl 6 have some experience with other programming languages, and we should try not to break too much for them. We use * for multiplication, not because it's the best thing to do, but because everybody does it. We do some stuff differently than the rest, but only where there is a very good reason.

    Second, it's too late. This may surprise some folks, but there is a considerable Perl 6 code base and user base out there that would need rewriting and retraining. While the Perl 6 design team is still open to new ideas, there haven't been any radical changes to syntax or semantics in the last few years; and if there will be, it'll just delay things even more.

      I wonder how come Perl6 looks like it was designed in the void Universe then. And yes, the claim that there exists a Perl6 (I'd really rather not drop the space, let's help Google distinguish Perl from this ... thing) user base does surprise me and I bet quite a few others. Any supporting evidence?

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        (Shrug...)   My take-away has always been that it is a project “in a void universe” ... just another group of Monty Pythons looking for the Holy Grail, and bootstrapping against the reputation of and some of the syntax of “Perl” in the doing of it.   But, all of it to no useful effect.   Quibbling, yes, even here, about how many milliseconds it takes to trigger your right index-finger muscle one more time (or less, as the case may be) in a quest to write things in a way suitable to The Knights Who Say –Er.”   They have been quibbling among themselves for more than six years, give-or-take, and by now I conclude that they will be doing so forever.   Even an encounter with the Holy Larry has not brought any closure to their efforts.   The mythos of “disruptive change” is highly over-rated.

Re: Capturing parenthesis and grouping square brackets
by grondilu (Friar) on Jun 18, 2013 at 03:17 UTC

    I'm not sure I understand the implications your point could have on Perl 6. Do you suggest that it might make sense to do the non-capturing groupings with parenthesis, and the capturing ones with brackets, instead of the other way around?

    Honestly I think it would make some sense, but it would nevertheless be confusing, for we all remember that () does capture, just because we know that's what they also do in Perl 5.

      That's pretty much what I meant yes. There are major changes in Perl 6's rules anyway (like square brackets actually, which go from character class to grouping), so expecting metacharacters to have the same meaning might lead to some trouble. But as stated in ambrus's answer, that's not only what Perl user would expect, it's what a lot of people using other languages would expect. Still, Perl 6's rules would be quite the surprise, because one would expect Perl of all languages to be compatible with PCRE. But actually, one of the main focuses of apocalypse 5 is to mend whatever went wrong with Perl regexes, and PCRE.

Re: Capturing parenthesis and grouping square brackets
by Tanktalus (Canon) on Jun 20, 2013 at 16:20 UTC

    BrowserUk: I don't know about Eily, but I type my regexes at 30-40wpm, I think that's close enough. Not that I'm entirely sure what your point is.

    hdb: I've seen more than one less-than-entirely-joking proposal to localise perl-the-language. Without breaking existing code :)

    rjt: Actual change, yes, has to be really good to justify breakages. But it doesn't have to be that good to simply provoke a discussion. Who knows, it might lead to a change that is good enough while not actually causing breakages. But to do that, people must listen to the pain points that any proposal is intended to address, not just the solution being proposed.

    And, really, that's the point behind my response. I don't think Eily's suggestion will, or even should be, taken at face value. But the desire for laziness is valid. And comments pointing out how this is actually the lazy way for others are good counterpoints. But to focus on the proposal while ignoring the pain seems to me to be natural yet backwards. Looking to improve the language, even if that's solely out of newbishness (and I'm not sure that's the case here or not), is always deserving of a ++. Well, unless, I suppose, they aren't open to learning why their proposal may actually be counterproductive, if, indeed, that turns out to be the case (such as with this one, I think). Though I, too, use (?:...) a lot, I still think () as capturing is probably better than the reverse.

      Thanks for the consideration. And while I'm not a Perl newbie, I just set my foot in the Perl community, and this was probably my first meaningful attempt at questioning a language I used. So no matter how foolish, I won't think it was a mistake, and I'd do it again in a heartbeat if I ended up going back in time. I just havn't found a Time::Machine module to do that yet.

      I do agree with BrowserUk on the point that the added keystroke for typing [ is not so much trouble though. Even if it significantly changed my typing time, it would not matter as much when compared to all the development time that does not involve any code-writing. I'm afraid this whole discussion comes from my failed attempt at poking fun at myself by pretending to propose a syntax change for my sole comfort, but I did get the answers I was expecting on my other points, so that's ok :)

      BrowserUk: I don't know about Eily, but I type my regexes at 30-40wpm, I think that's close enough.

      Sorry, but I do not believe you. Unless you make a frequent habit of copy-typing existing code.

      If the regex is so simple that you can mentally construct it at 30wpm(1), then it is so short that sum of the sub-second differences between typing Alt-Gr shifted-characters over a non-shifted characters will be less than the time taken to find the position in the code to type it.

      And if the regex is long enough for those sub-second differences to add up to anything substantial, no human being can formulate the regex in their head fast enough to supply their fingers at 30wpm. Nobody!

      Indeed, the same can be said (and proven) for all programming. Touch typist programmers may type short bursts(2) of 30-40wpm; but then they have to stop and think.

      Their actual rate averages out to something like 10-20 lines per day.

      Most of a programmers time is spent editing what exists not producing new; and in both cases, deciding what to type and where, is a such a huge proportion, 95-98%, of the total time, the speed the actual tokens are entered is totally inconsequential.

      (1What constitutes a 'word' in a regex anyway?)

      (2And in (informal) testing of 5 experienced programmers who were also touch typists, whilst they generated lines of code in short burst of speed (mostly less than 20 tokens/second), they also spent far longer than non-touch-typists going back and deleting/re-typing chunks, lines and whole sections as they realised they had used a bad name for a variable; or used a for loop where a while loop was better; or typed themselves into an algorithmic dead end.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Haskell vs Perl 6 (was Re: Capturing parenthesis and grouping square brackets)
by raiph (Deacon) on Jun 28, 2013 at 20:39 UTC
    I've moved this comment to top level from here so code formatting is sane.


    Jenda said:

    No, I don't think Perl5ers would find the {P6 example below} most approachable. ... I would find the Haskell version much easier to explain.

    I've explained the P6 code below. Would you (or someone else) be willing to explain the Haskell code to me or P5ers who don't know Haskell?


    Problem definition: Write a routine that will compare the leaves ("fringe") of two binary trees to determine whether they are the same list of leaves when visited left-to-right.

    Solutions:

    === haskell === data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show, Eq) fringe :: Tree a -> [a] fringe (Leaf x) = [x] fringe (Node n1 n2) = fringe n1 ++ fringe n2 sameFringe :: (Eq a) => Tree a -> Tree a -> Bool sameFringe t1 t2 = fringe t1 == fringe t2 === Perl 6 === sub fringe ($tree) { multi sub fringey (Pair $node) { fringey $_ for $node.kv } multi sub fringey ( Any $leaf) { take $leaf } (gather fringey $tree), Cool; } sub samefringe ($a, $b) { all fringe($a) Z=== fringe($b) }


    And now the P6 code again with my explanatory comments:

    sub fringe ($tree) { # $tree is aliased to $a passed # by the fringe($a) call below. multi sub fringey (Pair $node) # multi sub means other sub defs # will use the same sub name. # Pair is a builtin type. # It's like a one element hash. # $node is "type constrained" - # it may only contain a Pair. { fringey $_ for $node.kv } # .kv is a method that returns # (key, value) of a Pair. # fringey is called twice, with # key as arg then value as arg. # If arg is a Pair, call this # fringey recursively. If not, # call sub fringey(Any $leaf). multi sub fringey (Any $leaf) # Each multi sub def must have a # different signature (params). # This def's param has Any type. { take $leaf } # take yields a value that is # added to a gather'd list. (gather fringey $tree), Cool; # This gather lazily gathers a # list yielded by take $leaf's. # Calls to fringe return a list. # Cool is a flavor of undef. } sub samefringe ($a, $b) # $a and $b are binary trees # built from Pairs eg: # $a = 1=>((2=>3)=>(4=>5)); # $b = 1=>2=>(3=>4)=>5; # $c = 1=>2=>(4=>3)=>5; # samefringe($a,$b) # True # samefringe($a,$c) # False { all # Builtin "all" returns True if # all following items are True. # Parallelizes & short-circuits. fringe($a) Z=== # === returns True if its LHS # and RHS are the same value. # Z is the zipwith metaop. # Z=== does === between each of # the items in the LHS list with # each of those in the RHS list. fringe($b) } }
      Would you (or someone else) be willing to explain the Haskell code to me or P5ers who don't know Haskell?

      With license:

      === haskell === --data type Tree is --either -- a Leaf containing a single item (of type a) -- or -- a Node containing two subtrees (that can have leaves of type a) data Tree a = Leaf a | Node (Tree a) (Tree a) -- and it inherits 'methods' / 'roles' from types Show and Eq, deriving (Show, Eq) -- Fringe is a function that extracts a list of type a from a tree con +taining type a fringe :: Tree a -> [a] -- If its argument is of type leaf -- it returns a single element list containing the element held in the + leaf fringe (Leaf x) = [x] -- If its argument is a Node -- it returns the list returned by the left subtree -- concatenated to the list returned by the right subtree fringe (Node n1 n2) = fringe n1 ++ fringe n2 -- sameFringe takes two Trees as its arguments -- and uses the == method inherited/included from the Eq role -- to compare the lists returned from the two trees for equality -- returning true if they are the same. sameFringe :: (Eq a) => Tree a -> Tree a -> Bool sameFringe t1 t2 = fringe t1 == fringe t2

      Personally, I find the Perl6 far easier to read. I'd be very happy to use Perl6 if it had threading and ran at perl5 speed.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I'd add a note saying something like "While the code looks like the two trees are flattened to lists first and then compared, Haskell is lazy, which means that behind the scenes it will only do as much work as it has to and stop flattening once it finds the first mismatch. Simply don't worry about it. It works."

        I've read both codes and both explanations (the fact that the one for Haskell is quite a bit shorter is telling. On a Perl forum.) and still think the Haskell code is much clearer and cleaner.

        A better explanation of the Perl6 code might help somewhat. "take yields a value that is added to a gather'd list." Humpf?!? "Cool is a flavor of undef." Whatafuck?

        Heck the all fringe($a) Z=== fringe($b) alone is crazy enough to require two pages of explanation.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: Capturing parenthesis and grouping square brackets
by sundialsvc4 (Abbot) on Jun 18, 2013 at 12:29 UTC

    I admit to laugh, quite openly, when I read things (in the linked article) such as ... “there are things about regex culture that need breaking.”   That is truly spoken like someone who doesn’t have a million lines of in-service legacy code to maintain.   “Yee, hah, let’s go changing the fundamental meaning of the language to make it (according to me, and the rest of you are wrong) –er.”   If you think I’m going to approve a budget-dime for that, or lobby for it to be approved, you’re nuts.

    That’s why Perl-6 is still, stillborn.   Because a programming language, really, is quite a small thing compared to the vast amount of in-service application and CPAN-library code that is out there.   (The market value of which, I think, rests easily in the billions of dollars.)   There is no evolutionary transition-plan here; not does one appear to be possible, let alone economical, let alone particularly beneficial.   It would be very nice (may-be) if it were otherwise, but it isn’t.   The syntax may be “ugly,” but there are by-now millions of examples of it in $$$ervice.

      Edited to clarify. Added another couple examples of support for transition.

      TL;DR Perl 6 accords Perl 5 special status, with many elements specifically designed to make P5/P6 interop straightforward. And, for those who want it, transition.

      There is no evolutionary transition-plan here; not does one appear to be possible, let alone economical, let alone particularly beneficial.

      Imo the ways in which P6 design and implementation bridges with P5 demonstrate great care about interop and evolutionary transition at both a big picture and tiny detail level.

      First, sticking to just regex:

      It's not just regex either:

      Please visit the IRC channel #perl6 on freenode and help improve it by constructively discussing things you think will improve transition or interop between P5 and P6. Thanks go to Larry, jnthn, and all who try to help. Illegitimi non carborundum! :)

        "Perl 6 match objects"? OK, not that I think Perl6 actually matters, but still ... having experience with the insanely overcomplicated Match/Matches/etc./etc./etc./etc. objects of the .Net regular expressions ... this sounds scary. So I asked uncle Google, ended up in per6-regex-intro and ... I did not like what I saw. OK. So the matches are in the object named $/ (OK, so $/ meant something in Perl5, let's change that!) and you can access them as $/[0], $/[1], ... (hey, wasn't we supposed to use @ for arrays. Oh, right, this is an object, you just index it as if it was an array, but ...) ... and there are shortcuts in the form $0, $1, $2, ...

        WHAT?!? Yeah, the sortcuts start with $0! Yeah, what used to be $1 is now $0, what used to be $2 is now $1 etc. Just lovely! Imagine you got yourself into the situation when you have to write Perl6 code while still maintaining Perl5 code. I'd love to have a penny for every full hour lost because of the confusion caused by this change! Even with the number of people using Perl6 now, I'm sure it'd pay a few beers.

        The more I know about Perl6, the more I hope it's gonna disappear almost without a trace as a failed experiment. It uses the name, it looks deceptively similar, yet it is full of changes carefuly designed to cause as much confusion as possible.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

        Horray! Keep telling people how good it's gonna be when one of these projects finally gets finished!! Everybody is on the edge of their seats, except I was holding my breath but then I turned blue! So other than Perl-6 causing cyanosis it sounds really good!