Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Incorrect Pattern Matching Behavior

by T.G. Cornholio (Scribe)
on Mar 22, 2006 at 00:11 UTC ( [id://538355]=perlquestion: print w/replies, xml ) Need Help??

T.G. Cornholio has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
Why does this match the regexp when using case insensitive matching? IMHO this should not match in either case and yet it does when I add case insensitivity. Here is the code:
$_ = ' onn (bbcreccsnnl_output) !OUTPUT'; if ($_ =~ /^S\s+[-]*\d+[\.\d+]*\s+[-]*\d+[\.\d+]*\s*\(\s*IOPUT|OUTPUT\ +s*\)/i) { print "It matches with case insensitive...\n"; } else { print "It does NOT match with case insensitive...\n"; } if ($_ =~ /^S\s+[-]*\d+[\.\d+]*\s+[-]*\d+[\.\d+]*\s*\(\s*IOPUT|OUTPUT\ +s*\)/) { print "It matches without case insensitive...\n"; } else { print "It does NOT match without case insensitive...\n"; }
And here is the output (on a linux box):
It matches with case insensitive... It does NOT match without case insensitive...
Your insights are greatly appreciated. This has me baffled (not that this is difficult to achieve or anything).

Update: it seems to have the same problem even with the much simpler regexp:
($_ =~ /\(\s*IOPUT|OUTPUT\s*\)/)

Replies are listed 'Best First'.
Re: Incorrect Pattern Matching Behavior
by jonadab (Parson) on Mar 22, 2006 at 02:05 UTC

    Allow me to rewrite your regular expression so that you can see the problem more easily:

    /^ThisPartDoesNotMatchAndIsInFactIrrelevantToYourProblem|OUTPUT\s*\)/

    Try it. With the case insensitivity, it will still match. Do you see the problem now? It can be corrected by using grouping (e.g., non-escaped parens) to specify more precisely what you really meant (update: as hv shows very nicely).


    Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
Re: Incorrect Pattern Matching Behavior
by McDarren (Abbot) on Mar 22, 2006 at 00:34 UTC
    Well, it's because the lower-case "output" in your string is being matched (rather than the upper-case which is what I suspect you are expecting). You can confirm that like so:
    if ($_ =~ /^S\s+[-]*\d+[\.\d+]*\s+[-]*\d+[\.\d+]*\s*\(\s*IOPUT|OUTPUT\ +s*\)/i) { print "PRE=$`\nMATCH=$&\nPOST=$'\nIt matches with case insensitive +...\n"; }
    Which outputs:
    PRE= onn (bbcreccsnnl_ MATCH=output) POST= !OUTPUT It matches with case insensitive...
    That's the easy bit :)

    Getting your expression to actually do what you want it to may be a little trickier. Perhaps if you could list the "rules" for the match, I (or others) may be able to help you craft an appropriate expression.

    Cheers,
    Darren :)

      Thanks for the quick reply. I actually don't want it to match this line. The line I'm looking for should look something like:

      S 0.0 0.0 (OUTPUT)

      How is this matching at all when the input doesn't have a first character of "S"? Shouldn't the ^S be enough to say that this line does not match?

      Also, it uses \(\s*INOUT|OUTPUT\s*\). How can this match when there are non whitespaces between the opening ( and the string OUTPUT?

      My rules I guess would be

      Starts with S
      Two floats (could be negative or integer portion only)
      Opening parentheses possibly followed by whitespace
      Either INOUT or OUTPUT (case insensitive) possibly followed by whitespace
      Closing parentheses

      I may be very confused here, so your help is much appreciated.

        The problem is in the implementation of this clause:

        Either INOUT or OUTPUT

        Your regular expression is getting parsed as:

        / # either this ^ S \s+ [-]* \d+ [\.\d+]* \s+ [-]* \d+ [\.\d+]* \s* \( \s* IOPUT | # or this OUTPUT \s* \) /ix

        To achieve your aims, you need to tell perl where your list of alternates starts and ends, with capturing (...) or non-capturing (?:...) parens:

        / # all of this ^ S \s+ [-]* \d+ [\.\d+]* \s+ [-]* \d+ [\.\d+]* \s* \( \s* # and one of these two (?: IOPUT | OUTPUT ) # and all of this \s* \) /ix

        I'd also recommend using the extended layout permitted by the //x flag for long expressions like this.

        Hope this helps,

        Hugo

        Shouldn't the ^S be enough to say that this line does not match?
        Yes, I believe it should - and I'm afraid that part of it also has me stumped. Perhaps some other monk can explain why that is so.

        Update: ahh, of course - as others have pointed out below - it's because you haven't used parentheses to define the boundaries of your alternation :)

        Anyhow, getting back to your requirements, here is how I would do it:

        Update: oops, I just realised that I posted the wrong pattern. It can be simplified somewhat by grouping the part that matches the floats and using the {2} quantifier. I've updated it (output remains the same)

        use strict; use warnings; while (<DATA>) { if (/^S\s+(?:\-?\d+(?:\.\d+)?\s+){2}\(\s?(?:(INOUT|OUTPUT))\s?\)/) + { print "Matched:$_"; } else { print "Did NOT match:$_"; } } __DATA__ onn (bbcreccsnnl_output) !OUTPUT S 0.0 0.0 (OUTPUT) A 0.0 0.0 (OUTPUT) S 1 4 5 (OUTPUT) S 35 -27 ( INOUT ) S -26.95 32.73 (OUTPUT )
        The above outputs:
        Did NOT match: onn (bbcreccsnnl_output) !OUTPUT Matched:S 0.0 0.0 (OUTPUT) Did NOT match:A 0.0 0.0 (OUTPUT) Did NOT match:S 1 4 5 (OUTPUT) Matched:S 35 -27 ( INOUT ) Matched:S -26.95 32.73 (OUTPUT )
        Which I believe meets your requirements, yes?
Re: Incorrect Pattern Matching Behavior
by T.G. Cornholio (Scribe) on Mar 22, 2006 at 02:14 UTC
    Thanks to everyone. Both hv and jonadab pointed out the solution. It's working perfectly now.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://538355]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-24 18:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found