![]() |
|
P is for Practical | |
PerlMonks |
Precise backreference semantics in Perl regular expressionsby Gro-Tsen (Initiate) |
on Nov 09, 2023 at 15:00 UTC ( [id://11155507]=perlquestion: print w/replies, xml ) | Need Help?? |
Gro-Tsen has asked for the wisdom of the Perl Monks concerning the following question: Greetings enlightened Monks, Main part of the question: Perl thinks that aabadcaa matches the regexp /^((a*)b|a*b?d)*c\2$/ but aababdcaa doesn't. Is this normal? GNU grep (whether with -E or -P syntax) thinks they both match; so does pcregrep; so does Python's regexp engine. Since Perl seems pretty much alone in thinking that aababdcaa doesn't match /^((a*)b|a*b?d)*c\2$/ (actually, JavaScript agrees, but for a completely different reason, and indeed JavaScript thinks aabadcaa doesn't match either), I wonder if Perl's choice in this matter is intentional, whether it's an undocumented quirk (as in “you shouldn't write this sort of regexps anyway”) or whether it's a bug of sorts. More generally, is there documentation somewhere (other than the source itself) as to what backreferences in Perl regexps match and behave under backtracking, and how/why/where it differs from other regexp engines (GNU grep, Python, JavaScript and so on)? Comment: The way I understand regexps, when reading either aabadcaa or aababdcaa, the first part aab is matched by the left (a*)b part of the disjunction, while what comes after, be it ad or abd can only be matched by the right a*b?d part of the same disjunction, so \2 can only have the value aa and both strings should match. I think this is the logic that grep, pcregrep and Python follow. Specifically, even if you start by matching the middle ab against (a*)b, the ensuing d forces you to backtrack, and it seems that backtracking should restore all matches to their previous states. Perl does not seem to do this, and my question is why (is this intentional or is it a quirk?) or if this is documented anywhere. Extra: With grep, pcregrep or Python, a regexp such as /^((a*)b|a*b)*c\2$/ matches (IIUC) words of the form an₁ban₂b⋯ankbcam where m equals one of the ni; with Perl, it matches only when m equals the last (nk). Assuming I understand correctly, and assuming this is intentional, is there a way to ask Perl to behave in the same way as grep/pcregrep/Python in this matter? (I.e., do a full backtracking and match if there is some way to choose the branches in every disjunct so that the entire regexp matches?)
Back to
Seekers of Perl Wisdom
|
|