http://www.perlmonks.org?node_id=11155507

Gro-Tsen has asked for the wisdom of the Perl Monks concerning the following question:

Greetings enlightened Monks,

Main part of the question: Perl thinks that aabadcaa matches the regexp /^((a*)b|a*b?d)*c\2$/ but aababdcaa doesn't. Is this normal?

GNU grep (whether with -E or -P syntax) thinks they both match; so does pcregrep; so does Python's regexp engine. Since Perl seems pretty much alone in thinking that aababdcaa doesn't match /^((a*)b|a*b?d)*c\2$/ (actually, JavaScript agrees, but for a completely different reason, and indeed JavaScript thinks aabadcaa doesn't match either), I wonder if Perl's choice in this matter is intentional, whether it's an undocumented quirk (as in “you shouldn't write this sort of regexps anyway”) or whether it's a bug of sorts.

More generally, is there documentation somewhere (other than the source itself) as to what backreferences in Perl regexps match and behave under backtracking, and how/why/where it differs from other regexp engines (GNU grep, Python, JavaScript and so on)?

Comment: The way I understand regexps, when reading either aabadcaa or aababdcaa, the first part aab is matched by the left (a*)b part of the disjunction, while what comes after, be it ad or abd can only be matched by the right a*b?d part of the same disjunction, so \2 can only have the value aa and both strings should match. I think this is the logic that grep, pcregrep and Python follow. Specifically, even if you start by matching the middle ab against (a*)b, the ensuing d forces you to backtrack, and it seems that backtracking should restore all matches to their previous states. Perl does not seem to do this, and my question is why (is this intentional or is it a quirk?) or if this is documented anywhere.

Extra: With grep, pcregrep or Python, a regexp such as /^((a*)b|a*b)*c\2$/ matches (IIUC) words of the form an₁ban₂b⋯ankbcam where m equals one of the ni; with Perl, it matches only when m equals the last (nk). Assuming I understand correctly, and assuming this is intentional, is there a way to ask Perl to behave in the same way as grep/pcregrep/Python in this matter? (I.e., do a full backtracking and match if there is some way to choose the branches in every disjunct so that the entire regexp matches?)

Replies are listed 'Best First'.
Re: Precise backreference semantics in Perl regular expressions
by choroba (Cardinal) on Nov 09, 2023 at 15:21 UTC
    Seems weird. And what really smells suspicious is that if you switch the order of the alternatives, the second string suddenly matches the regex.
    for my $regexp (qr/^((a*)b|a*b?d)*c\2$/, qr/^(a*b?d|(a*)b)*c\2$/) { for my $string (qw( aabadcaa aababdcaa )) { say $string =~ /$regexp/ ? 'Yes' : 'No'; } }
    Output:
    Yes No Yes Yes

    If you want to know what Perl's regex engine does, not just guess, run the script with

    use re 'debug';

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Precise backreference semantics in Perl regular expressions
by hv (Prior) on Nov 09, 2023 at 16:44 UTC

    The main question is whether \2 should have a value at all when that alternate failed to match last time through the alternation.

    See the recent bug report Perl #21563. There is definite inconsistency in how backreferences are preserved or discarded when backtracking, but it is unclear what consistent behaviour we should be aiming at.

    Based on the expectation I (hvds) described in that issue ("... that we would retain only results from the last successful match, so that ((a)|(b))+ would never return captures of both a and b") with which demerphq agreed, neither of your strings should match - after we match the /a*b?d/ alternate, \2 from the other alternate should be unset.

    I hope that at some point we will settle on the principle of how it should behave, and document that clearly; I don't think we're there yet. I would not, however, expect to see extensive comparisons with other regexp engines in Perl's documentation: there are too many moving targets involved. I think I remember that PCRE documentation had more of such comparisons though.

Re: Precise backreference semantics in Perl regular expressions
by tybalt89 (Monsignor) on Nov 09, 2023 at 17:11 UTC

    My perl disagrees.

    #!/usr/bin/perl use strict; #use warnings; for my $regexp ( qr/^((a*)b|a*b?d)*c\2$/, qr/^(a*b?d|(a*)b)*c\2$/, ) { for my $string (qw( aabadcaa aababdcaa )) { print "\n$string\n$regexp\n"; print $string =~ /$regexp/ ? 'Yes' : 'No', "\n"; } } print <<END; This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-li +nux-thread-multi END

    Outputs:

    aabadcaa (?^:^((a*)b|a*b?d)*c\2$) No aababdcaa (?^:^((a*)b|a*b?d)*c\2$) No aabadcaa (?^:^(a*b?d|(a*)b)*c\2$) Yes aababdcaa (?^:^(a*b?d|(a*)b)*c\2$) Yes This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-li +nux-thread-multi

      Opinion...

      #!/usr/bin/perl use strict; #use warnings; for my $regexp ( qr/^((a*)b|a*b?d)*c\2$/, qr/^(a*b?d|(a*)b)*c\2$/, qr/^((a*)b|a*b?d())*c\2$/, # NOTE these two fixed to work the way qr/^(?|a*b?d()|(a*)b)*c\1$/, # I think they should work ) { for my $string (qw( aabadcaa aababdcaa )) { print "\n$string\n$regexp\n"; print $string =~ /$regexp/ ? 'Yes' : 'No', "\n"; } } print <<END; This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-li +nux-thread-multi END

      Outputs:

      aabadcaa (?^:^((a*)b|a*b?d)*c\2$) No aababdcaa (?^:^((a*)b|a*b?d)*c\2$) No aabadcaa (?^:^(a*b?d|(a*)b)*c\2$) Yes aababdcaa (?^:^(a*b?d|(a*)b)*c\2$) Yes aabadcaa (?^:^((a*)b|a*b?d())*c\2$) No aababdcaa (?^:^((a*)b|a*b?d())*c\2$) No aabadcaa (?^:^(?|a*b?d()|(a*)b)*c\1$) No aababdcaa (?^:^(?|a*b?d()|(a*)b)*c\1$) No This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-li +nux-thread-multi
Re: Precise backreference semantics in Perl regular expressions
by ikegami (Patriarch) on Nov 09, 2023 at 16:21 UTC

    (...)* doesn't make much sense. And as extension of that, neither does (?:...(...)...)*.