Using negative lookahead

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I want to create a regex that will identify a string surrounded by quotes, and remove the quotes. If the quote symbol appears within the string, the match should fail. The quotes can be either ' or ". Eventually they might be multi-character strings (e.g. ''). I'm not concerned at this point about recognizing escaped embedded quotes. This is slightly contrived .. I mostly want to understand why a negative lookahead isn't working the way I thought it would.

I sure would appreciate being shown what I'm misunderstanding.

#!/usr/bin/env perl
use warnings;
use strict;

my @cases = (
    q{'abc"def'},
    q{'abc'},
    q{"abc"},
    q{''},
    q{'abc'def'},               # Want this to fail matching
    q{'This shouldn't match'},  # Want this to fail matching
    q{"This isn't a problem"},
    q{"abc},
    q{abc"},
    q{abc},
    q{'abc"},
    q{'ab''},  # Want this to fail matching
);

strip_quotes($_) for @cases;

# If we can remove a matching pair of single or double quotes from 
# a string, without the quote symbol also appearing within the string,
# do so. Otherwise don't change the string.

sub strip_quotes {
    my $line = shift;
    print "\n$line\n";

    # NO NEGATIVE LOOKAHEAD
    # This works except it allows an embedded delimiter
    if ( $line =~ m{^         # anchor
                    (         # capture delimiter in pos 1
                        ["']  # delim is single or double quote
                    )
                    (.*)      # anything
                    \g1$}x    # finally, the delim
                ) {
        print " 1- Got a match: delimiter was {$1}, body was {$2}\n";
    }
    else {
        print " 1- No match.\n";
    }

    # ATTEMPTING NEGATIVE LOOKAHEAD
    # This should fail if the delimiter is found in non-terminal pos.
    if ( $line =~ m{^         # anchor
                    (         # capture delimiter in pos 1
                        ["']  # delim is single or double quote
                    )
                    (.*(?!\g1))  # neg lookahead for delim
                    \g1$}x    # finally, the delim
                ) {
        print " 2- Got a match: delimiter was {$1}, body was {$2}\n";
    }
    else {
        print " 2- No match.\n";
    }

}
[download]

Result:

'abc"def'
 1- Got a match: delimiter was {'}, body was {abc"def}
 2- No match.

'abc'
 1- Got a match: delimiter was {'}, body was {abc}
 2- No match.

"abc"
 1- Got a match: delimiter was {"}, body was {abc}
 2- No match.

''
 1- Got a match: delimiter was {'}, body was {}
 2- No match.

'abc'def'
 1- Got a match: delimiter was {'}, body was {abc'def}
 2- No match.

'This shouldn't match'
 1- Got a match: delimiter was {'}, body was {This shouldn't match}
 2- No match.

"This isn't a problem"
 1- Got a match: delimiter was {"}, body was {This isn't a problem}
 2- No match.

"abc
 1- No match.
 2- No match.

abc"
 1- No match.
 2- No match.

abc
 1- No match.
 2- No match.

'abc"
 1- No match.
 2- No match.

'ab''
 1- Got a match: delimiter was {'}, body was {ab'}
 2- No match.
[download]

Comment on Using negative lookahead Select or Download Code

Replies are listed 'Best First'.
Re: Using negative lookahead by tybalt89 (Monsignor) on Oct 19, 2017 at 01:44 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1201640 use strict; use warnings; my @cases = ( q{'abc"def'}, q{'abc'}, q{"abc"}, q{''}, q{'abc'def'}, # Want this to fail matching q{'This shouldn't match'}, # Want this to fail matching q{"This isn't a problem"}, q{"abc}, q{abc"}, q{abc}, q{'abc"}, q{'ab''}, # Want this to fail matching ); strip_quotes($_) for @cases; # If we can remove a matching pair of single or double quotes from # a string, without the quote symbol also appearing within the string, # do so. Otherwise don't change the string. sub strip_quotes { my $line = shift; print "\n$line\n"; # NO NEGATIVE LOOKAHEAD # This works except it allows an embedded delimiter if ( $line =~ m{^ # anchor ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) (.) # anything \g1$}x # finally, the delim ) { print " 1- Got a match: delimiter was {$1}, body was {$2}\n"; } else { print " 1- No match.\n"; } # ATTEMPTING NEGATIVE LOOKAHEAD # This should fail if the delimiter is found in non-terminal pos. if ( $line =~ m{^ # anchor ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) #(.(?!\g1)) # neg lookahead for delim ((?!.\g1.).) # neg lookahead for delim \g1$}x # finally, the delim ) { print " 2- Got a match: delimiter was {$1}, body was {$2}\n"; } else { print " 2- No match.\n"; } } [download]	[reply] [d/l]
Re^2: Using negative lookahead by ibm1620 (Hermit) on Oct 19, 2017 at 22:57 UTC
I'm still having trouble grasping this. Let me try and restate what your solution is doing: `^ ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) ((?!.\g1.).) # neg lookahead for delim \g1$ # finally, the delim` [download] Pick up the delimiter character from pos 1, if there is one (otherwise fail) Capture this in $2: -- As many characters as possible (could be none) -- that are NOT followed by the delimiter character and another character (which is the case of an embedded delimiter, which should be a fail) -- and are followed by zero or more characters. Beyond this should be the delimiter, at the end of the string. I'm confused by the '.' that's the last part of capture group $2. Why is it needed at all? Hasn't the preceding already consumed the payload that I want? I guess I'm not understanding the precise role that the negative lookahead is playing. Is it simply saying what the string picked up by . must look like? Is there any significance to (?!.\g1.) appearing before, rather than after, . ?	[reply] [d/l]
Re^3: Using negative lookahead by tybalt89 (Monsignor) on Oct 20, 2017 at 00:17 UTC
Zero-width negative lookaheads do not advance the match point -> "zero-width".	[reply]
Re^4: Using negative lookahead by ibm1620 (Hermit) on Oct 20, 2017 at 14:19 UTC
Re: Using negative lookahead by haukex (Archbishop) on Oct 19, 2017 at 08:27 UTC
The others have already given you some excellent answers. I just wanted to point out Regexp::Common::delimited as well as Text::Balanced, which contain functions that implement something like what you are doing. Also, this task sounds like something that one encounters when parsing formats for which parsers might already exist, a common one being Text::CSV. Also, personally I find it easier to write my regex tests as I showed in Re: How to ask better questions using Test::More and sample data.	[reply]
Re^2: Using negative lookahead by ibm1620 (Hermit) on Oct 19, 2017 at 23:13 UTC
Thank you for the pointer to Test::More. It's very timely since just today there's been a discussion at work of formalizing QA test cases. I have used Regexp::Common::delimited and Text::CSV. But in addition to simply wanting to understand negative lookahead, the problem I'm working on involves making a best-effort attempt to tokenize strings that don't conform to any single set of rules. (The strings are metadata declarations of the parameter lists that 100 or so different programs read and parse with their own idiosyncratic logic.) I'll be making guesses (programatically) about what I'm encountering, and hoping to break the strings into meaningful units, with no expectation of 100% correctness. Also, the machine this is to run on is pretty locked down and the sysadmins are reluctant to install CPAN modules. You pick your battles.... :-(	[reply]
Re: Using negative lookahead by LanX (Saint) on Oct 19, 2017 at 01:29 UTC
> I mostly want to understand why a negative lookahead isn't working the way I thought it would. The negative lookahead `(.(?!\g1))` says "Fail if I can match the delimiter". And since all your examples end with an delimiter or even never start with one you have not one match. I think this ~~`(.(?!\g1).)`~~ `((?!.*\g1.))` should fix it "Fail if I can match the delimiter which is still followed by any other character" edit personally I would advice to use a negated character class. (Update see here ) update that's what you want? (update: nope its still wrong, am too tired and tybald89 already got it right :) Read more... (3 kB) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: Using negative lookahead by ibm1620 (Hermit) on Oct 19, 2017 at 22:14 UTC
Thank you for "Fail if I can match the delimiter which is still followed by any other character". I would have used a negated character class except I wanted to maybe support multi-character quotes (e.g. two apostrophes, ''). But mostly I just wanted to understand the negative lookahead!	[reply]
Re: Using negative lookahead by LanX (Saint) on Oct 19, 2017 at 16:44 UTC
Another take on it: Unfortunately regexes don't support backreferences in character classes - like `[^\g1]` - to forbid the delimiter inside the string. (at least I couldn't find it.) But it's possible to have the same effect with negative lookaheads `DB<34> $_='abbbbbbbbbba' DB<35> x m/^(.) ( (?!\1) . )* \1$/x 0 'a' 1 'b'` [download] NB a lookahead doesn't move the position, that's why it has to be moved with an `.` And this approach seems to work in your code: Read more... (4 kB) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: Using negative lookahead by ibm1620 (Hermit) on Oct 20, 2017 at 13:51 UTC
"Unfortunately regexes don't support backreferences in character classes - like `[^\g1]` - to forbid the delimiter inside the string. (at least I couldn't find it.)" I wouldn't think that would work in the general case, where `\g1` refers to more than one character, would it?	[reply] [d/l] [select]
Re^3: Using negative lookahead by LanX (Saint) on Oct 20, 2017 at 14:01 UTC
depends what you mean with the general case. Do you mean ... ... `[^\g1]` with \g1 more than one letter? This is hypothetical, since even one letter doesn't work. ... `( (?! \g1) .)` This Would disallow a multibyte sequence if the match holds a word. I.e. like the word "not" to be forbidden to follow ... `( (?! \g1) (?! \g2) .)` Here chaining look-aheads work like AND conditions. For single bytes, this would be equivalent of `[^\g1\g2]` (if it was possible) you might be interested this excellent tutorial Using Look-ahead and Look-behind Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: Using negative lookahead by ibm1620 (Hermit) on Oct 19, 2017 at 23:17 UTC
Actually this looks simpler than the earlier solution. And I get the bit about moving the position with '.'.	[reply]
Re: Using negative lookahead by Anonymous Monk on Oct 20, 2017 at 15:46 UTC
Perhaps you could use a full-on parser, such as Parse::RecDescent ...	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks

Using negative lookahead

edit

update