Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^4: Regex::Reverse tricky test cases

by Roy Johnson (Monsignor)
on May 17, 2005 at 14:03 UTC ( [id://457810]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Regex::Reverse tricky test cases
in thread Regex::Reverse tricky test cases

Is that supposed to be a variable-length lookbehind (it lacks an =)? Perhaps ironically, I haven't put any handling for lookahead/behind in my program.

What I would need to do is to convert lookbehinds to lookaheads, and vice-versa. Other than that, the normal reversing would work. Ok, I've done that. Program says:

Before: (?-imsx:(?<=\d+)(A)\1) After: (?-imsx:(A)\1(?=\d+))

Caution: Contents may have been coded under pressure.

Replies are listed 'Best First'.
Re^5: Regex::Reverse tricky test cases
by dragonchild (Archbishop) on May 17, 2005 at 14:10 UTC
    You're right, I should have been more careful. The regex I was trying to write (if I'd actually tested it) should have been: (?-imsx: ((?<=\d+))(A)\1) - I forgot that (?<=) is non-capturing.

    Your reversal fails, in large part because I didn't give you any test input.

    my $string = "123A1234"; my $regex = qr/((?<=\d+))(A)\1(.*)/; my ($num, $letter, $leftover) = $string =~ $regex; ok( $num eq '123' ); ok( $letter eq 'A' ); ok( $leftover eq '4' ); ---------------------------------------------------------- my $rev_string = "4321A321"; my $rev_regex = qr/.../; # <-- what goes here?? my ($leftover, $letter, $num) = $rev_string =~ $rev_regex; ok( $num eq '123' ); ok( $letter eq 'A' ); ok( $leftover eq '4' );

    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      Let me start with a description of the general backreference problem. Before that, let me inform you of the three classes of regex:
      1. reversable
      2. reversable with side-effects
      3. unreversable

      Backreference Basics

      A backreference is related to a capturing group: for every capturing group, there may be zero or more instances of its backreference; likewise for every backreference, there is only one capturing group it refers to. The regex (a(b))\1((c)d)\1\3\2 matches the text "ababcdabcdb". When this string is reversed, we have "bdcbadcbaba". We need a systematic reversal of the regex that matches the string in the same manner. (We don't need to worry about the subset/superset problem described here, but that will eventually be an issue to deal with.)

      Backreference Reversal

      Single Capturing Groups

      The regex (a)B+\1 reverses to (a)B+\1 -- simple enough. Once you add multiple instances of the backreference, though, you needn't worry about any but the last: (a)B+\1C+\1 reverses to (a)C+\1B+\1.

      Multiple Capturing Groups

      When you have multiple capturing groups, the biggest issue is the order in which the backreferences appear, and whether they are interspersed with capturing groups. Here are several regexes to compare and contrast:
      1. (a)(b)\1\2 (matching "abab")
      2. (a)(b)\2\1 (matching "abba")
      3. (a)\1(b)\2 (matching "aabb")
      4. (a)\1(b)\1 (matching "aaba")
      Regex 1 reverses to (b)(a)\1\2. Regex 2 reverses to (a)(b)\2\1 (a "perfect" reversal). Regex 3 reverses to (b)\1(a)\2. Regex 4 reverses to (a)(b)\1\1. The important distinction to make is: when does the original capturing group N change to reversed capturing group M? It's a mechanical process.

      Introduction of Quantifiers

      Quantifiers on a Capturing Group

      In the event that a capturing group has a quantifier outside it, special care must be taken. The regex ([abc])+X\1 matches strings such as "abcXc", "caXa", "bbabXb", etc. The regex might be confusing in that form, so before reversing it, it might help to visualize it as [abc]*([abc])X\1. This reverses to ([abc])X\1[abc]*.

      The regex ([abc])X\1+ reverses to ([abc])\1*X\1. This is a trivial case because \1+ is no different in purpose from \1*\1.

      Zero-Possible Quantifiers on Capturing Groups

      Snags are encountered with zero-possible quantifiers (meaning *, ?, or {0,n}). The first example is a quantifier on the capturing group that includes a possibility of zero repetitions. If a capturing group is "null" (meaning it never matched, not it matched no characters), its backreference causes the regex to fail immediately.

      The regex ([abc])*X\1 will match strings like "abcXc" and "cbXb", but will fail on the string "X". Therefore, if a regex has a zero-or-more-quantified capturing group and a backreference to that group, the "zero" part of the quantifier can become a "one" without damaging the sense of the regex. If, however, no backreference is found or that capturing group, no change is to be made! How does one make this change? Here is an example: ([abc])*X([def])*Y\2 becomes ([def])[def]*Y\1X(?:([abc])[abc]*)?.

      The main point of interest is the reversal of ([abc])* to (?: ([abc]) [abc]* )? (whitespace added for clarity). We can't require this to be matched (since it can match zero times and be successful), so the entire thing is wrapped in a non-capturing block with a zero-or-one quantifier. The inside is the standard reversal of the (P)+ pattern ((P)P*).

      Zero-Possible Quantifiers on Backreferences

      The other snag is a backreference that might never be matched, as in the regex ([ab])X\1?Y, which matches "aXY", "aXaY", "bXY", and "bXbY", but something like "aXbY" or "aXrY". In this case, the reversal of the regex requires the injection of a logical assertion. Here is the naive reversal using the logical assertion: Y([ab])?X(?(1)\1|[ab]). The regex reads as "match a Y, then optionally capture 'a' or 'b' to \1, then match an X, and then: if \1 exists (if subgroup 1 matched successfully, that is), match it, otherwise match an 'a' or 'b'."

      This works well, except for one important problem: when the string "YXa" is matched by this reversed regex... backreference 1 is null! This is certainly not the case with "aXY" being matched by the forward-working regex, in which backreference 1 holds "a". We have encountered a Class II regex, because in order to get the reversal to work properly, we must produce a side-effect inside our logical assertion: the creation of another backreference! If the capturing group failed to match, we should then capture the 'a' or 'b' that matches afterwards: Y([ab])?X(?(1)\1|([ab])). This is slightly different from Y([ab])?X((?(1)\1|[ab])): the first one captures to \2 if and only if capturing group 1 is not matched, whereas the second one captures to \2 no matter what.

      Logical Assertions in Regexes

      If a regex has a logical assertion in it, it is a non-trivial case. Consider (a)?X(?(1)b|c), which matches "aXb" or "Xc". To reverse this, and match "bXa" or "Xc", we could try bX(a)|cX, but that seems too much like cheating. It is, because if X were something complex (like something involving capturing groups), we could not merely repeat it twice in the regex like we have without incurring side-effects. We could opt, therefore, to cause a side-effect on our own terms: another capture group being created. (?:(b)|c)X(?(1)(a)) solves the problem, creating one additional capture group.

      Dealing with Side Effects

      If the side effects of the reversal are managed properly, they can then be handled internally by Perl, so that the end user does not notice them at all. In the case of reversing ([ab])X\1?Y, we end up with Y([ab])?X(?(1)\1|([ab])). Before the results of this regex are returned to the user, the reversal engine should be able to "smash" backreferences \1 and \2 together. This is not necessarily trivial.
      I'll write more later, but I have a lunch date. Stay tuned.

      Oh, and here's my answer to your regex reversal example. Assuming you meant your regex to be (?<=(\d+))(A)\1, it would reverse to (\d+)(A)(?=\1). If you end up having a (.*) at the end (as I saw in some of the nodes), I'd reverse it as (.*?)(\d+)(A)(?=\2). The reason for the non-greedy .* is because of the subset/superset rule which is described in the link mentioned previously in this node.


      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      This is a good example of what makes it difficult. In addition to showing me that I need to renumber my backreferences properly, I still get no match with the generated regex, qr/(.*)((?=\d+))(A)\2/, because you can't swap a lookaround for a real match.

      So lookarounds in conjunction with backreferences are what really make it difficult. (Incidentally, your suggested pattern is illegal, because the lookbehind is variable-length.)


      Caution: Contents may have been coded under pressure.
        In addition to showing me that I need to renumber my backreferences properly, I still get no match with the generated regex, qr/(.*)((?=\d+))(A)\2/, because you can't swap a lookaround for a real match.

        First, I think you want (.*?), not (.*).

        Second, the biggest problem with reversing against the string is the need to have the parsing be ( 4, 321, 'A' ), not ('', 4321, 'A' ) as will have to happen with the \d+. The issue is you need to know what the stuff in \2 is before you match ((?=\d+)).

        Also, positive-lookahead is not the direct reversal of negative-lookbehind. For one thing, it's a zero-width assertion that will fail as written because you can't have numbers and letters in the same position. I think it's better reversed as qr/(.*?)\2(A)((?=\d+))/, which makes no sense because \2 isn't populated when it needs to be.

        (Incidentally, your suggested pattern is illegal, because the lookbehind is variable-length.)

        Yes, it is. But, that's one of the main considerations you gave for wanting to have a regex reverser ...


        • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
        • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://457810]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-28 18:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found