Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Greetings enlightened Monks,

Main part of the question: Perl thinks that aabadcaa matches the regexp /^((a*)b|a*b?d)*c\2$/ but aababdcaa doesn't. Is this normal?

GNU grep (whether with -E or -P syntax) thinks they both match; so does pcregrep; so does Python's regexp engine. Since Perl seems pretty much alone in thinking that aababdcaa doesn't match /^((a*)b|a*b?d)*c\2$/ (actually, JavaScript agrees, but for a completely different reason, and indeed JavaScript thinks aabadcaa doesn't match either), I wonder if Perl's choice in this matter is intentional, whether it's an undocumented quirk (as in “you shouldn't write this sort of regexps anyway”) or whether it's a bug of sorts.

More generally, is there documentation somewhere (other than the source itself) as to what backreferences in Perl regexps match and behave under backtracking, and how/why/where it differs from other regexp engines (GNU grep, Python, JavaScript and so on)?

Comment: The way I understand regexps, when reading either aabadcaa or aababdcaa, the first part aab is matched by the left (a*)b part of the disjunction, while what comes after, be it ad or abd can only be matched by the right a*b?d part of the same disjunction, so \2 can only have the value aa and both strings should match. I think this is the logic that grep, pcregrep and Python follow. Specifically, even if you start by matching the middle ab against (a*)b, the ensuing d forces you to backtrack, and it seems that backtracking should restore all matches to their previous states. Perl does not seem to do this, and my question is why (is this intentional or is it a quirk?) or if this is documented anywhere.

Extra: With grep, pcregrep or Python, a regexp such as /^((a*)b|a*b)*c\2$/ matches (IIUC) words of the form an₁ban₂b⋯ankbcam where m equals one of the ni; with Perl, it matches only when m equals the last (nk). Assuming I understand correctly, and assuming this is intentional, is there a way to ask Perl to behave in the same way as grep/pcregrep/Python in this matter? (I.e., do a full backtracking and match if there is some way to choose the branches in every disjunct so that the entire regexp matches?)


In reply to Precise backreference semantics in Perl regular expressions by Gro-Tsen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-05-29 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found