http://www.perlmonks.org?node_id=24661


in reply to Death to Dot Star!

Aha! One of the classic mistakes was made on this code:
$myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses [^?"] # Anything that's not a question mark or quote | # or \?[^"] # A question mark not followed by a quote (to a +llow embedded question marks) )* # Zero or more ) # End capture \?"/x; # Followed by a question mark and quote
Try this with
$myvar = q{ abc"def??"ghi?"jkl };
And you'll see that it matches the ghi, not def??. The problem is that the "question mark NOT followed by a quote" can sometimes eat up the question mark that you need to begin your closing delimiter.

The proper way to tackle this is to "inch-along"...

$myvar = q{ abc"def??"ghi?"jkl }; print "matched <$1>" if $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses (?!\?") # not question quote? . # ok to inch along )* # Zero or more ) # End capture \?"/sx; # Followed by a question mark and quote
which properly prints:
matched <def?>

I was tackling this kind of thing a lot when people would keep posing the "how do I match a C comment?" back in the early days Pre-Ilya-RE. I got pretty good at breaking just about any regex that claimed to match a comment, by undoing any assumption made.

-- Randal L. Schwartz, Perl hacker

Replies are listed 'Best First'.
WARNING merlyn wrote BAD CODE
by Ovid (Cardinal) on Jul 27, 2000 at 22:55 UTC
    Okay, the title is kind of a joke. It's just a good-natured tweak at merlyn for the brouhaha over his WARNING t0mas wrote BAD CODE node that generated so much flak. No offense intended :)

    merlyn's code was bugging me, but I couldn't quite put my finger on it. My problem was that the dot metacharacter is so indiscriminating that it will match anything. However, I simply assumed that if merlyn posted the code, it must work. His code is great if you're checking for C-style comments that begin and end in something like /* comment here */ or "? comment here ?". But if you read my post, that's not what we were checking for:

      What happens if you were trying to extract questions in quotes without the trailing question mark?
    I mentioned embedded question marks (my idea was that we might have more than one question in a quote), but I never mentioned embedded quotes. I just wanted one set of quotes and my original post bears that out. Here's merlyn's code and my correction:
    #!/usr/bin/perl -w $myvar = q{ abc"def"g"hi?"jkl }; # This regex is from merlyn print "matched <$1>\n" if $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses (?!\?") # not question quote? . # ok to inch along )* # Zero or more ) # End capture \?"/sx; # Followed by a question mark and quote # This regex is from Ovid print "matched <$1>\n" if $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses [^?"] # Not a question mark or parentheses | # or \?(?!") # A question mark not followed by a quote )* # Zero or more ) # End Capture \?"/sx; # Followed by a question mark and quote
    The first regex will print matched <def"g"hi>. The second will print matched <hi>.

    No disrespect is intended towards Randal as he was right in pointing out that my first regex was broken.

    Cheers,
    Ovid

        A reply falls below the community's threshold of quality. You may see it by logging in.
(Ovid) RE(2): Death to Dot Star!
by Ovid (Cardinal) on Jul 27, 2000 at 19:37 UTC
    Drat, drat drat! And I was on a roll :) Nice work.

    Rather than simply testing for a question mark followed by a character that is not a quote (\?[^"]), I should have tested for a question mark with a negative look-ahead (\?(?!")) for a quote. This appears to work:

    $myvar =~ /"((?:[^?"]|\?(?!"))*)\?"/';
    Unfortunately, Benchmark shows that it's not quite as fast as merlyn's version.

    For those unfamiliar with lookaheads, they allow you to test for text without "bumping along" the regex. In other words, \?[^"] will check for a question mark followed by a non-quote character, but further matching of the regex continues after the non-quote character. \?(?!") allows you to check for a question mark not followed by a quote, but continues matching after the question mark.

    Note: There is a subtle difference between the negated character class and the negative lookahead. The negated character class generally requires a character after the question mark (in the above example), while the negative lookahead just makes sure that a quote doesn't follow the question mark and doesn't require a character.

    Cheers,
    Ovid