Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Using negative lookahead

by ibm1620 (Hermit)
on Oct 19, 2017 at 01:14 UTC ( [id://1201640]=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I want to create a regex that will identify a string surrounded by quotes, and remove the quotes. If the quote symbol appears within the string, the match should fail. The quotes can be either ' or ". Eventually they might be multi-character strings (e.g. ''). I'm not concerned at this point about recognizing escaped embedded quotes. This is slightly contrived .. I mostly want to understand why a negative lookahead isn't working the way I thought it would.

I sure would appreciate being shown what I'm misunderstanding.

#!/usr/bin/env perl use warnings; use strict; my @cases = ( q{'abc"def'}, q{'abc'}, q{"abc"}, q{''}, q{'abc'def'}, # Want this to fail matching q{'This shouldn't match'}, # Want this to fail matching q{"This isn't a problem"}, q{"abc}, q{abc"}, q{abc}, q{'abc"}, q{'ab''}, # Want this to fail matching ); strip_quotes($_) for @cases; # If we can remove a matching pair of single or double quotes from # a string, without the quote symbol also appearing within the string, # do so. Otherwise don't change the string. sub strip_quotes { my $line = shift; print "\n$line\n"; # NO NEGATIVE LOOKAHEAD # This works except it allows an embedded delimiter if ( $line =~ m{^ # anchor ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) (.*) # anything \g1$}x # finally, the delim ) { print " 1- Got a match: delimiter was {$1}, body was {$2}\n"; } else { print " 1- No match.\n"; } # ATTEMPTING NEGATIVE LOOKAHEAD # This should fail if the delimiter is found in non-terminal pos. if ( $line =~ m{^ # anchor ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) (.*(?!\g1)) # neg lookahead for delim \g1$}x # finally, the delim ) { print " 2- Got a match: delimiter was {$1}, body was {$2}\n"; } else { print " 2- No match.\n"; } }
Result:
'abc"def' 1- Got a match: delimiter was {'}, body was {abc"def} 2- No match. 'abc' 1- Got a match: delimiter was {'}, body was {abc} 2- No match. "abc" 1- Got a match: delimiter was {"}, body was {abc} 2- No match. '' 1- Got a match: delimiter was {'}, body was {} 2- No match. 'abc'def' 1- Got a match: delimiter was {'}, body was {abc'def} 2- No match. 'This shouldn't match' 1- Got a match: delimiter was {'}, body was {This shouldn't match} 2- No match. "This isn't a problem" 1- Got a match: delimiter was {"}, body was {This isn't a problem} 2- No match. "abc 1- No match. 2- No match. abc" 1- No match. 2- No match. abc 1- No match. 2- No match. 'abc" 1- No match. 2- No match. 'ab'' 1- Got a match: delimiter was {'}, body was {ab'} 2- No match.

Replies are listed 'Best First'.
Re: Using negative lookahead
by tybalt89 (Monsignor) on Oct 19, 2017 at 01:44 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1201640 use strict; use warnings; my @cases = ( q{'abc"def'}, q{'abc'}, q{"abc"}, q{''}, q{'abc'def'}, # Want this to fail matching q{'This shouldn't match'}, # Want this to fail matching q{"This isn't a problem"}, q{"abc}, q{abc"}, q{abc}, q{'abc"}, q{'ab''}, # Want this to fail matching ); strip_quotes($_) for @cases; # If we can remove a matching pair of single or double quotes from # a string, without the quote symbol also appearing within the string, # do so. Otherwise don't change the string. sub strip_quotes { my $line = shift; print "\n$line\n"; # NO NEGATIVE LOOKAHEAD # This works except it allows an embedded delimiter if ( $line =~ m{^ # anchor ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) (.*) # anything \g1$}x # finally, the delim ) { print " 1- Got a match: delimiter was {$1}, body was {$2}\n"; } else { print " 1- No match.\n"; } # ATTEMPTING NEGATIVE LOOKAHEAD # This should fail if the delimiter is found in non-terminal pos. if ( $line =~ m{^ # anchor ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) #(.*(?!\g1)) # neg lookahead for delim ((?!.*\g1.).*) # neg lookahead for delim \g1$}x # finally, the delim ) { print " 2- Got a match: delimiter was {$1}, body was {$2}\n"; } else { print " 2- No match.\n"; } }
      I'm still having trouble grasping this. Let me try and restate what your solution is doing:
      ^ ( # capture delimiter in pos 1 ["'] # delim is single or double quote ) ((?!.*\g1.).*) # neg lookahead for delim \g1$ # finally, the delim
      Pick up the delimiter character from pos 1, if there is one (otherwise fail)

      Capture this in $2:

      -- As many characters as possible (could be none)

      -- that are NOT followed by the delimiter character and another character (which is the case of an embedded delimiter, which should be a fail)

      -- and are followed by zero or more characters.

      Beyond this should be the delimiter, at the end of the string.

      I'm confused by the '.*' that's the last part of capture group $2. Why is it needed at all? Hasn't the preceding already consumed the payload that I want? I guess I'm not understanding the precise role that the negative lookahead is playing. Is it simply saying what the string picked up by .* must look like? Is there any significance to (?!.*\g1.) appearing *before*, rather than after, .* ?

        Zero-width negative lookaheads do not advance the match point -> "zero-width".

Re: Using negative lookahead
by haukex (Archbishop) on Oct 19, 2017 at 08:27 UTC
      Thank you for the pointer to Test::More. It's very timely since just today there's been a discussion at work of formalizing QA test cases.

      I have used Regexp::Common::delimited and Text::CSV. But in addition to simply wanting to understand negative lookahead, the problem I'm working on involves making a best-effort attempt to tokenize strings that don't conform to any single set of rules. (The strings are metadata declarations of the parameter lists that 100 or so different programs read and parse with their own idiosyncratic logic.) I'll be making guesses (programatically) about what I'm encountering, and hoping to break the strings into meaningful units, with no expectation of 100% correctness.

      Also, the machine this is to run on is pretty locked down and the sysadmins are reluctant to install CPAN modules. You pick your battles.... :-(

Re: Using negative lookahead
by LanX (Saint) on Oct 19, 2017 at 01:29 UTC
    > I mostly want to understand why a negative lookahead isn't working the way I thought it would.

    The negative lookahead (.*(?!\g1)) says

    • "Fail if I can match the delimiter".

    And since all your examples end with an delimiter or even never start with one you have not one match.

    I think this (.*(?!\g1).) ((?!.*\g1.)) should fix it

    • "Fail if I can match the delimiter which is still followed by any other character"
    edit

    personally I would advice to use a negated character class. (Update see here )

    update

    that's what you want?

    (update: nope its still wrong, am too tired and tybald89 already got it right :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Thank you for "Fail if I can match the delimiter which is still followed by any other character".

      I would have used a negated character class except I wanted to maybe support multi-character quotes (e.g. two apostrophes, ''). But mostly I just wanted to understand the negative lookahead!

Re: Using negative lookahead
by LanX (Saint) on Oct 19, 2017 at 16:44 UTC
    Another take on it:

    Unfortunately regexes don't support backreferences in character classes - like [^\g1] - to forbid the delimiter inside the string. (at least I couldn't find it.)

    But it's possible to have the same effect with negative lookaheads

    DB<34> $_='abbbbbbbbbba' DB<35> x m/^(.) ( (?!\1) . )* \1$/x 0 'a' 1 'b'

    NB a lookahead doesn't move the position, that's why it has to be moved with an .

    And this approach seems to work in your code:

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      "Unfortunately regexes don't support backreferences in character classes - like [^\g1] - to forbid the delimiter inside the string. (at least I couldn't find it.)"

      I wouldn't think that would work in the general case, where \g1 refers to more than one character, would it?

        depends what you mean with the general case.

        Do you mean ...

        • ... [^\g1] with \g1 more than one letter?
        This is hypothetical, since even one letter doesn't work.
        • ... ( (?! \g1) .)*
        This Would disallow a multibyte sequence if the match holds a word.

        I.e. like the word "not" to be forbidden to follow

        • ... ( (?! \g1) (?! \g2) .)*
        Here chaining look-aheads work like AND conditions.

        For single bytes, this would be equivalent of [^\g1\g2] (if it was possible)

        you might be interested this excellent tutorial

        Using Look-ahead and Look-behind

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

      Actually this looks simpler than the earlier solution. And I get the bit about moving the position with '.'.
Re: Using negative lookahead
by Anonymous Monk on Oct 20, 2017 at 15:46 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1201640]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (8)
As of 2024-04-20 00:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found