http://www.perlmonks.org?node_id=225948

steves has asked for the wisdom of the Perl Monks concerning the following question:

Okay, just when I feel like a Perl-meister, I get stumped by this seemingly innocent regex match that uses a lookahead assertion:

use strict; my $test = '<p> ONE <p> TWO features: <ul> THREE'; my ($first, $second, $third) = ($test =~ /(<p> )(.+(?!<p>))(features\: + <ul>)/i); print "1: $first\n"; print "2: $second\n"; print "3: $third\n";

Output is:

1: <p> 2: ONE <p> TWO 3: features: <ul>

I was expecting only TWO for the second match since there's an intervening <P>. Some monk please humble me here.

Replies are listed 'Best First'.
Re: Lookahead assertion confusion
by Enlil (Parson) on Jan 10, 2003 at 21:08 UTC
    Let's break this down and see if you can't find the problem:
    /(<p> ) # #1 (.+(?!<p>)) # #2 (features\: <ul>)/ix) # #3

    1. the pattern must start with <p> followed by a space
    2. afterwhich the pattern must grab as much as it can provided that #3 is statisfied and that whatever .+ matches cannot be followed by <p>.
    3. the pattern has to end with features: <ul>
    all this is case insensitive.

    so if you look at it once more you realize that nowhere is rule #2 broken as it is not followed by a <p> what you are looking for is more along these lines:my ($first, $second, $third) = ($test =~ /(<p> ).+?<p>(.+)(features: <ul>)/i);

    update:being human, I assosciate significance to big round numbers in a, more often times than not, strange superstitious manner. and thus i mark post 100

    -enlil

Re: Lookahead assertion confusion
by sauoq (Abbot) on Jan 10, 2003 at 21:55 UTC

    In your case, you have a negative look-ahead assertion and you actually specify what must follow. It's kind of like saying "match 'foo' as long as 'foo' is followed by 'bar' and not followed by 'baz'." Well, if 'foo' is followed by 'bar' then it can't be followed by 'baz' so the assertion is useless. That's why your expression works exactly the same as it would without the assertion.

    A negative look-ahead assertion asserts that your expression isn't followed by a pattern. It does not prevent the pattern from being matched within the expression.

    Consider this example of how not to use it:

    $_ = 'foobar'; /(.*)(?!bar)/; print "$1\n";
    That prints "foobar" because there is no "bar" following the string "foobar". Here is an example of how you might use it effectively:
    $_ = 'foobar'; /(.*o)(?!bar)/; print "$1\n";
    Notice the literal "o" I added. Now the expression only matches "fo" because the fixed string "bar" does follow "foo." Perl first makes the match, then determines if the match is good by looking at the fixed length string that immediately follows the match. If it can assert that the string does not immediately follow, then the match is good, otherwise it has to backtrack.

    Edits: Minor typos fixed. Slight rewording.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Lookahead assertion confusion
by jdporter (Paladin) on Jan 10, 2003 at 21:07 UTC
    Well, since there isn't anything between the second paren group and the third in the regex, the regex will only match if "<p>" doesn't immediately precede "features". And so the second group will get everything -- including any "<p>", if present -- between the first "<p>" and the "features".

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: Lookahead assertion confusion
by thelenm (Vicar) on Jan 10, 2003 at 22:38 UTC
    The lookahead assertion (?!<p>) means that your string must not match <p> beginning at the point of the lookahead assertion. Basically, the lookahead assertion is having no effect, since it's impossible for a string to match both <p> and features beginning at the same point.

    -- Mike

    --
    just,my${.02}

Re: Lookahead assertion confusion
by steves (Curate) on Jan 10, 2003 at 22:13 UTC

    Thanks. I see the error of my ways. What I wanted was to capture the pieces as long as there were no intervening <p> tags in between. Stupid me was thinking the lookahead assertion checked if the expression was anywhere in between, whereas I had specified immediately following. Expanding the lookahead expression to a more compreshensive match was the key. I ended up using this regex to do the match:

    ($test =~ /(<p> )(?!.*<p>.*)(.+)(features\: <ul>)/i)
      Without having more test cases it is hard to know whether your regex will work as you intended. Sure the one stated here works for this case, but it works just as well if you replace
      (?!.*<p>.*)
      with
      (?!.*<p>)
      .

      But I think you are missing the point of the assertion, as $1 now captures the second <p> in your string, and not the first, as the original had intended, which might not be important in this case, but might be in your understanding of what how better to approach regex problems) in which case:

      /<p>.*?(<p> )(.+)(features\: <ul>)/i
      or /[^(?:<p>)]+?(<p> )(.+)(features\: <ul>)/i works just as well. just because TMTOWTDI, doesn't mean that you have to use every tool in the toolshed to get to result when a simple screwdriver would have sufficed (pardon the expression).

      -enlil

      Unfortunately that fails for '<p> ONE <p> TWO features: <ul> THREE <p>'.

      You can fix this by making the engine go one char at a time:   /(<p> )((?:(?!<p>).)*)(features: <ul>)/i Hope I've helped,
      ihb

      Update:
      Just for fun:   /(<p> )(?>(.*?)((<p>)|features: <ul>))(?(4)(?!))/i
Re: Lookahead assertion confusion
by dragonchild (Archbishop) on Jan 10, 2003 at 21:06 UTC
    You have all your parentheses as capturing ones. Try using (?:...) instead of (...) when you just want to group. :-)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.