Contributed by Roy Johnson
on Dec 02, 2004 at 15:25 UTC
Q&A
> regular expressions
Description: I want to use a regex to capture 3-6 characters not containing a run of three of the same character. So, given AAABCDE, it would match AABCDE; and given ABCDDD, it would match ABCDD.
The most natural solution is to use lookbehind, starting with the third character, to check that the last three characters are not all the same:
/..(?:(.)(?<!\1\1\1)){1,4}/
The problem with that is that Perl's regex engine assumes that any backreference is variable-length, and variable-length lookbehinds are not allowed. Answer: How can I use backrefs in a lookbehind? contributed by Roy Johnson Use lookbehind to count back as many chars as you want, and at the front of it, put a lookahead to check your pattern:
/.. # match first two chars
(?:(.) # capture next char, then
(?<= # looking behind,
(?!\1\1\1) # don't allow a run of three
...) # starting three chars back
){1,4}/x
This technique can also overcome some other variable-length lookbehind situations. For example, if you want to match "bar" that is preceded by "foo" somewhere in the preceding six characters:
/(?<= # looking behind,
(?=.{0,3}foo) # look for a foo preceded by up to three chars
.{6}) # starting six chars back
bar/x # then match bar
The thing to remember is that the lookahead can see farther than the end of the lookbehind, so you need to explicitly limit it. You could use that feature to get a slightly different solution to the first problem:
/.. # match first two chars
(?:
(?<= # looking behind,
(?!(.)\1\1) # don't allow a run of three
..) # starting only two chars back
. # then match the next char
){1,4}/x
| Answer: How can I use backrefs in a lookbehind? contributed by Ieronim Only a small remark:
The idea of variable-length lookbehind is very good, but the given problem can be solved even without using lookbehinds at all:
#!/usr/bin/perl
use warnings;
use strict;
my $pat = qr{
( # 1: capture the whole substring
(?:
(.) # a character
(?!\2\2) # NOT repeated three times
){1,4} # one to four of such 'good' characters
.. # two any characters more; 2+4 = 6
)
}x;
foreach (qw/AABBCCDD AABBBCCD AAABBCCD AABBCCCD AAABCDE/) {
print "$1\n" if /$pat/;
}
outputs
AABBCC
AABB
AABBCC
AABBCC
AABCDE
|
Please (register and) log in if you wish to add an answer
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|