Variable-Width Lookbehind (hacked via recursion)by haukex (Abbot)
|on Oct 24, 2017 at 17:43 UTC||Need Help??|
Warning: Since this uses recursion it is horribly inefficient and may easily blow up on longer strings. If you think you need this for variable-width lookbehind, then first think about how you might solve this with other techniques like lookahead, which is variable-width out of the box, or simply with multiple regular expressions. /Warning The following is presented as a curiosity as the result of the discussion here - thank you LanX and QM for providing the inspiration :-)
Zero-width Lookaround Assertions are incredibly useful, but unfortunately the lookbehind assertions (?<=pattern) and (?<!pattern) are restricted to fixed length lookbehinds, and sometimes you just really want to be able to say something like e.g. (?<=ab+.*)c. With the following technique, you can emulate these kinds of variable-width lookbehind assertions.
The basic principle is to use repeated, recursive lookbehind operations, each of which is fixed-width (either one or two characters, depending on the situation), thus "looping" backwards through the string character by character, and using zero-width lookahead assertions at each position, checking whether a match occurs or not.
The examples are a bit contrived, and are just intended to demonstrate this idea. One thing to notice is that, although you can match multiple targets in a single string via /g, there are extra capture groups that you need to ignore (I'd suggest using the long forms and named capture groups to avoid confusion). Another thing to note is that the following examples scan backwards all the way to the beginning of the string (or until a match is found), without anchoring the "recursive lookbehind" at the current match position. <update> Also, I just tested this across different versions of Perl, and it (currently) only works correctly on Perl v5.20 and above. </update>
This matches the pattern x\d+, but only when it is preceded by the pattern ab\d+ with the same \d+ part (hence the \d+(?!\d)).
This matches any duplicated characters, so the inverse of the original problem here.
If the string you're looking for doesn't depend on the match, you can write it in the following order; the following matches any x\d+ that is preceded by ab+c.
Things get a little bit shorter for single characters; this matches any \d. that is preceded by an a (anywhere in the string before it, as I said above - otherwise you could just use (?<=a)).
Probably someone can find a way to improve on this even more :-)
Here's the code I used to test the above regexen:
Minor edits for clarification.