Variable-Width Lookbehind (hacked via recursion)

Warning: Since this uses recursion it is horribly inefficient and may easily blow up on longer strings. If you think you need this for variable-width lookbehind, then first think about how you might solve this with other techniques like lookahead, which is variable-width out of the box, or simply with multiple regular expressions. /Warning The following is presented as a curiosity as the result of the discussion here - thank you LanX and QM for providing the inspiration :-)

Zero-width Lookaround Assertions are incredibly useful, but unfortunately the lookbehind assertions (?<=pattern) and (?<!pattern) are restricted to fixed length lookbehinds, and sometimes you just really want to be able to say something like e.g. (?<=ab+.*)c. With the following technique, you can emulate these kinds of variable-width lookbehind assertions.

The basic principle is to use repeated, recursive lookbehind operations, each of which is fixed-width (either one or two characters, depending on the situation), thus "looping" backwards through the string character by character, and using zero-width lookahead assertions at each position, checking whether a match occurs or not.

The examples are a bit contrived, and are just intended to demonstrate this idea. One thing to notice is that, although you can match multiple targets in a single string via /g, there are extra capture groups that you need to ignore (I'd suggest using the long forms and named capture groups to avoid confusion). Another thing to note is that the following examples scan backwards all the way to the beginning of the string (or until a match is found), without anchoring the "recursive lookbehind" at the current match position. <update> Also, I just tested this across different versions of Perl, and it (currently) only works correctly on Perl v5.20 and above. </update>

This matches the pattern x\d+, but only when it is preceded by the pattern ab\d+ with the same \d+ part (hence the \d+(?!\d)).

my $re1 = qr{
    (?<target> x (?<digits> \d+ ) (?!\d) )
    (?= (?<lookback>
        (?<=
            (?! (?<match> ab \g{digits} (?!\d) ) ) .
            (?=(?&lookback)) .
        |
            (?=(?&match)) . .
        )
    ) )
}msx;
my $re1_short = qr
    / (x(\d+)(?!\d)) (?=((?<=(?!( ab\2(?!\d) )).(?=(?-2)).
        |(?=(?-1))..)))/sx;
[download]

This matches any duplicated characters, so the inverse of the original problem here.

my $re2 = qr{
    (?<char> . )
    (?= (?<lookback>
        (?<=
            (?! \g{char} ) .
            (?=(?&lookback)) .
        |
            (?= \g{char} ) . .
        )
    ) )
}msx;
my $re2_short = qr
    /(.)(?=((?<=(?!\1).(?=(?-1)).|(?=\1)..)))/sx;
[download]

If the string you're looking for doesn't depend on the match, you can write it in the following order; the following matches any x\d+ that is preceded by ab+c.

my $re3 = qr{
    (?= (?<lookback>
        (?<=
            (?! (?<match> ab+c ) )
            (?=(?&lookback)) .
        |
            (?=(?&match)) .
        )
    ) )
    (?<target> x \d+ )
}msx;
my $re3_short = qr
    /(?=((?<=(?!( ab+c ))(?=(?-2)).|(?=(?-1)).))) (x\d+) /sx;
[download]

Things get a little bit shorter for single characters; this matches any \d. that is preceded by an a (anywhere in the string before it, as I said above - otherwise you could just use (?<=a)).

my $re4 = qr{
    (?= (?<lookback>
        (?<=
            a
        |
            (?=(?&lookback)) .
        )
    ) )
    (?<target> \d . )
}msx;
my $re4_short = qr
    /(?=((?<= a |(?=(?-1)).))) (\d.) /sx;
[download]

Probably someone can find a way to improve on this even more :-)

Here's the code I used to test the above regexen:

#!/usr/bin/env perl
use warnings;
use strict;
use Test::More;

# regexen here...

for my $regex ($re1,$re1_short) {
    unlike "foo",            $regex;
    unlike "x5",             $regex;
    unlike "ab5",            $regex;
    unlike "ab5 x4",         $regex;
    unlike "x5 ab5",         $regex;
      like "ab5 x5",         $regex;
      like "ab5x5",          $regex;
      like "ab51 x51",       $regex;
      like "ab51 ab4 x5 x4", $regex;
    my @results;
    while ("ab1 ab32 x2 ab42 x3 ab3 ab4x4ab5x1x42x45" =~ /$regex/g)
        { push @results, $+{target} // $1 }
    is_deeply \@results, ["x4","x1","x42"];
}
for my $regex ($re2,$re2_short) {
    unlike "fo",       $regex;
    unlike "x5",       $regex;
    unlike "ab5",      $regex;
    unlike "ab5 x4",   $regex;
      like "foo",      $regex;
      like "x5 ab5",   $regex;
      like "ab5 x5",   $regex;
      like "ab55",     $regex;
      like "ab51 x51", $regex;
    my @results;
    while ("abcdefbdfbb" =~ /$regex/g)
        { push @results, $+{char} // $1 }
    is_deeply \@results, ["b","d","f","b","b"];
}
for my $regex ($re3,$re3_short) {
    unlike "foo",            $regex;
    unlike "x5",             $regex;
    unlike "ab5",            $regex;
    unlike "x5 abc5",        $regex;
    unlike "ab x4",          $regex;
      like "abc x4",         $regex;
      like "abc x5",         $regex;
      like "abbbcx5",        $regex;
      like "abbc51 x51",     $regex;
      like "abc51 ab x5 x4", $regex;
    my @results;
    while ("x2 abbbc x4 abc5 x1 x42" =~ /$regex/g)
        { push @results, $+{target} // $3 }
    is_deeply \@results, ["x4","x1","x42"];
}
for my $regex ($re4,$re4_short) {
    unlike "fo",        $regex;
    unlike "x5",        $regex;
    unlike "5ab",       $regex;
    unlike "x5 ab5",    $regex;
      like "ab5 x4",    $regex;
      like "x5 ab5 x2", $regex;
    my @results;
    while ("x2 a4 x3a55aaa1" =~ /$regex/g)
        { push @results, $+{target} // $2 }
    is_deeply \@results, ["4 ","3a","55"];
}

done_testing;
[download]

Minor edits for clarification.

Back to Meditations