Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Using Look-ahead and Look-behind

by Roy Johnson (Monsignor)
on Dec 21, 2005 at 21:57 UTC ( [id://518444]=perltutorial: print w/replies, xml ) Need Help??

If you are familiar with Perl's regular expressions, you are probably already familiar with zero-width assertions: the ^ indicating the beginning of string and the \b indicating a word boundary are examples. They do not match any characters, but "look around" to see what comes before and/or after the current position.

With the look-ahead and look-behind constructs documented in perlre.html#Extended-Patterns, you can "roll your own" zero-width assertions to fit your needs. You can look forward or backward in the string being processed, and you can require that a pattern match succeed (positive assertion) or fail (negative assertion) there.

Syntax

Every extended pattern is written as a parenthetical group with a question mark as the first character. The notation for the look-arounds is fairly mnemonic, but there are some other, experimental patterns that are similar, so it is important to get all the characters in the right order.
(?=pattern)
is a positive look-ahead assertion
(?!pattern)
is a negative look-ahead assertion
(?<=pattern)
is a positive look-behind assertion
(?<!pattern)
is a negative look-behind assertion
Notice that the = or ! is always last. The directional indicator is only present in the look-behind, and comes before the positive-or-negative indicator.

Common tasks

Finding the last occurrence

There are actually a number of ways to get the last occurrence that don't involve look-around, but if you think of "the last foo" as "foo that isn't followed by a string containing foo", you can express that notion like this:
/foo(?!.*foo)/
The regular expression engine will do its best to match .*foo, starting at the end of the string "foo". If it is able to match that, then the negative look-ahead will fail, which will force the engine to progress through the string to try the next foo.

Substituting before, after, or between characters

Many substitutions match a chunk of text and then replace part or all of it. You can often avoid that by using look-arounds. For example, if you want to put a comma after every foo:
s/(?<=foo)/,/g; # Without lookbehind: s/foo/foo,/g or s/(foo)/$1,/g
or to put the hyphen in look-ahead:
s/(?<=look)(?=ahead)/-/g;
This kind of thing is likely to be the bulk of what you use look-arounds for. It is important to remember that look-behind expressions cannot be of variable length. That means you cannot use quantifiers (?, *, +, or {1,5}) or alternation of different-length items inside them.

Matching a pattern that doesn't include another pattern

You might want to capture everything between foo and bar that doesn't include baz. The technique is to have the regex engine look-ahead at every character to ensure that it isn't the beginning of the undesired pattern:
/foo # Match starting at foo ( # Capture (?: # Complex expression: (?!baz) # make sure we're not at the beginning of baz . # accept any character )* # any number of times ) # End capture bar # and ending at bar /x;

Nesting

You can put look-arounds inside of other look-arounds. This has been known to induce a flight response in certain readers (me, for example, the first time I saw it), but it's really not such a hard concept. A look-around sub-expression inherits a starting position from the enclosing expression, and can walk all around relative to that position without affecting the position of the enclosing expression. They all have independent (though initially inherited) bookkeeping for where they are in the string.

The concept is pretty simple, but the notation becomes hairy very quickly, so commented regular expressions are recommended. Let's look at the real example of Regex to add space after punctuation sign. The poster wants to put a space after any comma (punctuation, actually, but for simplicity, let's say comma) that is not nestled between two digits. Building up the s/// expression:

s/(?<=, # after a comma, (?! # but not matching (?<=\d,) # digit-comma before, AND (?=\d) # digit afterward ) )/ /gx; # substitute a space
Note that multiple lookarounds can be used to enforce multiple conditions at the same place, like an AND condition that complements the alternation (vertical bar)'s OR. In fact, you can use Boolean algebra ( NOT (a AND b) === (NOT a OR NOT b) ) to convert the expression to use OR:
s/(?<=, # after a comma, but either (?: (?<!\d,) # not matching digit-comma before | # OR (?!\d) # not matching digit afterward ) )/ /gx; # substitute a space

Capturing

It is sometimes useful to use capturing parentheses within a look-around. You might think that you wouldn't be able to do that, since you're just browsing, but you can. But remember: the capturing parentheses must be within the look-around expression; from the enclosing expression's point of view, no actual matching was done by the zero-width look-around.

This is most useful for finding overlapping matches in a global pattern match. You can capture substrings without consuming them, so they are available for further matching later. Probably the simplest example is to get all right-substrings of a string:

print "$1\n" while /(?=(.*))/g;
Note that the pattern technically consumes no characters at all, but Perl knows to advance a character on an empty match, to prevent infinite looping.

Replies are listed 'Best First'.
Re: Using Look-ahead and Look-behind
by jds17 (Pilgrim) on May 07, 2009 at 16:13 UTC
    Thank you for your very nice article, I certainly learned some new tricks!

    Just one little comment: The code in the last paragraph did not work because by default regular expressions are greedy. (Did this change with the Perl versions in between?) The only right-substring that comes out is the full string:

    $_ = "Hello"; print "$1\n" while /(?=(.*))/g;
    Output:
    Hello
    Making the "(.*)" part non-greedy fixes it (in Perl 5.10):
    $_ = "Hello"; print "$1\n" while /(?=(.*)?)/g;
    Output:
    Hello ello llo lo o
      I think you have found a bug in 5.10's regex handling. The lookahead's greediness or non-greediness should not matter, because it does not consume any characters. When used in a global match, patterns that do not consume characters should advance one character on each match. At least that's how I read the documentation.

      The really interesting thing about your version is that you didn't make the capture non-greedy, you made it optional. You probably meant (.*?), which (in pre-5.10) will output empty strings every time. I haven't installed 5.10 myself, so I can't play with it right now.


      Caution: Contents may have been coded under pressure.
        ...which (in pre-5.10) will output empty strings every time.

        With 5.10.0, /(?=(.*?))/g; outputs one empty string.  And I can confirm the behavior reported by jds17 with /(?=(.*))/g.

        You are right, my change did not affect greediness. The bad thing is: now I don't understand why my proposed solution worked at all. Maybe someone can explain? I don't think the question is too important, but I like to use regular expressions and it bugs me a little if I cannot understand one (especially such a tiny one).

        I have read the documentation you have cited and it helped, so I played around some more and tried out the following, which only exchanges "+" for "*" in your original expression, really works as one would think and therefore would be my preferred solution, at least for Perl 5.10:

        $_ = "Hello"; print "$1\n" while /(?=(.+))/g;
        Output:
        Hello ello llo lo o
Re: Using Look-ahead and Look-behind
by Anonymous Monk on Apr 11, 2007 at 07:25 UTC
    Great! This is exactly what I was looking for. Thank you very much!
Re: Using Look-ahead and Look-behind
by Anonymous Monk on Jun 25, 2011 at 07:49 UTC
    The following is just not working. Basically, i want to match a value that has "equity",but NOT "private equity". The result must be items 1, 2, 4, 5. Please check this out:
    my %hash = ( 1 => 'equity, private equity', 2 => 'equity', 3 => 'private equity', 4 => 'private equity,equity', 5 => 'private equity, equity', 6 => 'equity,private equity', 7 => 'private equity', 8 => 'mutual funds', 9 => 'cds' ); while (my ($k, $v) = each %hash) { next unless $v =~ m/(?!private\s+)equity/; printf("%d -> %s\n", $k, $v); }

      Hi, new questions go in Seekers Of Perl Wisdom because

      Roy Johnson, whom you asked a question, hasn't been here in 6 weeks.

      You used code tags and put your code in between, that is awesome :)

      Welcome, see How do I post a question effectively?, Where should I post X?

      The regex which is not working for you, contains A zero-width negative look-ahead assertion, and like perlre#(?!pattern) says

      A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.

      If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use look-behind instead (see below).

      So, use a look-behind

      But, that probably won't work either, because you can't have variable length length lookbehind , so you need to use a fixed width lookbehind.

      #!/usr/bin/perl -- use strict; use warnings; use Test::More qw' no_plan '; Main(@ARGV); exit(0); sub Main { my @yesWant = ( 'equity, private equity', 'equity', 'private equity,equity', 'private equity, equity', 'equity,private equity', ); my @notWant = ( 'private equity', 'private equity', 'mutual funds', 'cds', ); for my $not ( @notWant ){ ok( (not TestEquity($not)), "not '$not'" ); } for my $yes ( @yesWant ){ ok( TestEquity($yes), "yes '$yes'" ); } } sub TestEquity { return 1 if $_[0] =~ m/(?<!private\s)equity/; return 0; } __END__ $ prove -v pm911357.lookbehind.pl pm911357.lookbehind.pl .. ok 1 - not 'private equity' ok 2 - not 'private equity' ok 3 - not 'mutual funds' ok 4 - not 'cds' ok 5 - yes 'equity, private equity' ok 6 - yes 'equity' ok 7 - yes 'private equity,equity' ok 8 - yes 'private equity, equity' ok 9 - yes 'equity,private equity' 1..9 ok All tests successful. Files=1, Tests=9, 0 wallclock secs ( 0.06 usr + 0.01 sys = 0.08 CPU +) Result: PASS

      If fixed width lookbehind doesn't work for you, simply do TWO tests

        Here's a solution that exactly matches the phrases specified in AnonyMonk's Re: Using Look-ahead and Look-behind post (which the code of Re^2: Using Look-ahead and Look-behind does not quite do), and also shows how to use the newfangled backtracking control verbs of 5.10 to emulate variable-width negative look-behind. Variable-width positive look-behind is emulated by 5.10's  \K assertion.

        Explanation:

        • Any 'equity' that is preceded by
          • either a character that is not a comma or whitespace, or
          • by the 'private' phrase
          FAILS and is skipped over (this test has first precedence);
        • Otherwise, any 'equity' that is not followed by a comma that is then followed by any non-whitespace SUCCEEDS.

        >perl -wMstrict -le "use Test::More 'no_plan'; ;; for my $ar_vector ( [ YES => 'equity, private equity', ], [ YES => 'equity', ], [ no => 'private equity', ], [ YES => 'private equity,equity', ], [ YES => 'private equity, equity', ], [ no => 'equity,private equity', ], [ no => 'private equity', ], [ no => 'mutual funds', ], [ no => 'cds' ], ) { my ($expected, $string) = @$ar_vector; is match($string), $expected, qq{'$string'}; } ;; sub match { my ($string) = @_; ;; my $char_not_comma_or_space = qr{ [^,\s] }xms; my $private = qr{ private \s+ }xms; return 'YES' if $string =~ m{ (?: $char_not_comma_or_space | $private) equity (*SKIP)(*FAIL) | equity (?! , \S) }xms; return 'no', } " ok 1 - 'equity, private equity' ok 2 - 'equity' ok 3 - 'private equity' ok 4 - 'private equity,equity' ok 5 - 'private equity, equity' ok 6 - 'equity,private equity' ok 7 - 'private equity' ok 8 - 'mutual funds' ok 9 - 'cds' 1..9
        Nice. Very nice! You nailed. It's working. Thanks a bunch!

        I changed the sub TestEquity to allow for any text between Private and Equity, but I can't get it to work. What have I done wrong?

        sub TestEquity { return 1 if $_[0] =~ m/(?<!private).*equity/; return 0; }
Re: Using Look-ahead and Look-behind
by narainhere (Monk) on Oct 17, 2007 at 12:51 UTC
    Thanks a lot DUDEEEEEEE........ Solved my problems ++ Roy Johnson

    The world is so big for any individual to conquer

Re: Using Look-ahead and Look-behind
by joewong (Initiate) on Nov 12, 2007 at 03:34 UTC
    Referring to the paragraph "Matching a pattern that doesn't include another pattern", I wonder why ?: is necessary. It seems to be working for me even without ?:. Please explain. thanks.
      Generally, the decision about using ?: is not about whether it's necessary, but that capturing whatever you're grouping isn't necessary. The ?: modifier makes parentheses not capture, which is somewhat more efficient and might make the task of counting left parentheses less onerous.

      By the way, it's not a lookaround feature.


      Caution: Contents may have been coded under pressure.
Re: Using Look-ahead and Look-behind
by aaaone (Initiate) on Jul 18, 2008 at 13:12 UTC
    Great article :) Thank you!
Re: Using Look-ahead and Look-behind
by greengaroo (Hermit) on Feb 05, 2013 at 15:08 UTC

    Thank you!

    Testing never proves the absence of faults, it only shows their presence.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://518444]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-19 08:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found