Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Puzzled by regex

by syphilis (Chancellor)
on Apr 10, 2013 at 03:35 UTC ( #1027882=perlquestion: print w/replies, xml ) Need Help??
syphilis has asked for the wisdom of the Perl Monks concerning the following question:

In's sub read_DATA there's a regex seeking to match /_\S+?__\n/
I haven't yet seen how the ? can have an effect on the result.

That is, for what strings will /__\S+?__\n/ and /__\S+__\n/ return a different result ?

I ran this script (perl-5.16.0) to check for some difference - but it doesn't detect any:
use warnings; @str = ("____\n", "__ __\n", "__X __\n", "__Z__\n", "__\n__\n", "__^__ +\n"); for(@str) { if($_ =~ /__\S+?__\n/) {print "1 "} else {print "0 "} if($_ =~ /__\S+__\n/) {print "1\n"} else {print "0\n"} } __END__ Outputs: 0 0 0 0 0 0 1 1 0 0 1 1

Replies are listed 'Best First'.
Re: Puzzled by regex
by davido (Archbishop) on Apr 10, 2013 at 06:16 UTC

    There is a difference, but it's not necessarily in what is getting matched, but rather, in how it's matching. Example:

    use strict; use warnings; my @strings = ( "____\n", "__ __\n", "__X __\n", "__Z__\n", "__\n__\n", "__^__\n", "_____\n", "________\n", ); foreach my $string ( @strings ) { print "<<$string>>\n"; if( $string =~ m/__(\S+?)__/ ) { print "\tNon-Greedy -- Match: (($1)).\n"; } else { print "\tNon-Greedy -- No Match.\n"; } if( $string =~ m/__(\S+)__/ ) { print "\tGreedy -- Match: [[$1]].\n"; } else { print "\tGreedy -- No Match.\n"; } }

    Most of that is going to be pretty boring, until you get to the last item in the list, where you'll get the following output:

    <<________ >> Non-Greedy -- Match: ((_)). Greedy -- Match: [[____]].

    I have no idea whether non-greedy matching is going to have any practical effect in the type of strings you're matching with the regex though.


      Yes, but you forgot to put the trailing \n into the two regexes :-)
      If I put it in, that makes the last one match as well:
      Non-Greedy -- Match: ((____)). Greedy -- Match: [[____]].
      Thanks for the replies guys.
      I'm about to mess with that code, but I was loathe to do that while I couldn't see why the ? had been included in the regex. I still don't see why it's there - but at least now I'm starting to feel a little confident that it serves no purpose. (I'll still probably leave it there ... because I'm feeling even more confident that it doesn't do any harm :-)


        but at least now I'm starting to feel a little confident that it serves no purpose. (I'll still probably leave it there ... because I'm feeling even more confident that it doesn't do any harm :-)

        Its probably a reflex :) I know when I write regex I make more mistakes from greedines than from non-greediness, so I tend to write +? *? to be on the safe side

        I know I'm not alone in getting bit by it , it is a frequent cause/solution from newbies

Re: Puzzled by regex
by Athanasius (Chancellor) on Apr 10, 2013 at 04:26 UTC

    When I saw the regex expression \S+?, my first thought was that this is equivalent to \S*. But it isn’t, as a little experimentation shows.

    Consulting the Camel Book (4th Edition, page 214), I found that + means “1 or more times maximally” and +? means “1 or more times minimally.”

    So, the difference between the two forms is not whether they match: if one matches, both must match. The difference lies only in what is matched, and this is relevant only if this part is captured (or, just possibly, if efficiency is an issue).

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      The difference lies only in what is matched, and this is relevant only if this part is captured

      Well ... the regex does capture that part but afaics, when both regexes match they both match the same thing.
      Do you have an example that demonstrates this difference ?

      Just to be clear - I can see that /\S+?/ and /\S+/ could conceivably match differently, but I don't see how /__\S+?__\n/ and /__\S+__\n/ can match differently.
      (And it's important to me that I do understand how they match differently if, indeed, they can.)

      In case I'm guilty of not presenting the full picture, the regex (it's a split) as it appears in is actually:
      @{$DATA{$pkg}} = split /(?m)(__\S+?__\n)/, $data;
Re: Puzzled by regex
by Don Coyote (Pilgrim) on Apr 10, 2013 at 08:18 UTC

    Random thoughts on match operating efficiency

    The difference between the maximally matched quantifier (.+) - greedy, and the minimally matched quantifier (.+?) - nongreedy, in the case of the +(1 or more) quantifier is what is matched but more importantly, how, or from where, it is matched

    In the maximal case the match position begins from eol and backtracks a position at a time and checks for the match, repeating till success or starting match position is reached

    In the nongreedy case the operator match position starts from the starting match postion and forward-tracks a character at a time until success or eol

    application of + quantifier behaviour to ? quantifier behaviour:

    applying this to the ?(0 or 1) quantifier, I would expect the matching start position differs in the case of a greedy match starting at 1 position ahead, and in the nongreedy case starting at the starting match position.

    Random summation:

    The difference is not in what is matched, but how, or from where, the matching starts. This effectively increases the nongreedy match efficiency by the reduction of one jump ahead operation per usage.

    Just Random:

    I would imagine this will have been internally optimised, unless (or even especially if) there is perhaps a security benefit of a look forward match opposed to a look behind match

    update later the same day

    crumbs, +(0 or 1) quantifier, well that is incorrect. This '+' is the (1 or more) quantifier.

    ok so to fix the above example i have replaced the '*' quantifiers with '+' quantifiers. And I have replaced the '+' quantifiers with '?' quantifiers, so at least what I wrote makes sense. Which it does despite the syntax errors now rectified.

    After attempting to provide some examples where differences would be found, between the default greedy and nongreedy behaviour indicated by a secondary '?' quantifier, I realised that you are right, there are no differences in what is matched, when the '\n' are included, and in agreement with davidos and my own response, being the difference is in how the match is carried out.

Re: Puzzled by regex
by Loops (Curate) on Apr 10, 2013 at 04:20 UTC

    In the first regular expression you're using the '?' operator which says the previous character or group is optional. But you're applying it against the '+' operator which makes no sense, since it means one-or-more.

    If instead you group the \S and + together and then apply the '?', you'll see different results:

    use warnings; my @str = ("____\n", "__ __\n", "__X __\n", "__Z__\n", "__\n__\n", "__ +^__\n"); for(@str) { if($_ =~ /__(\S+)?__\n/) {print "1 "} else {print "0 "} if($_ =~ /__\S+__\n/) {print "1\n"} else {print "0\n"} } __END__ 1 0 0 0 0 0 1 1 0 0 1 1

    So the answer is, just has a bug.

      This is not exactly accurate. \S+ greedily matches one or more non-space characters. \S+? non-greedily matches one or more non-space characters. The syntax does make sense.



        Thanks for correcting me. I was just dead wrong.

        It's explained here: Matching Repititions, Down a bit it explains, "minimal match or non-greedy quantifiers ?? , *? , +?, and {}?"

        Learn something new every day.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027882]
Approved by davido
[chacham]: Just paint the white house black again.
dbander snorts
LanX Gesundheit! :)
talexb wonders about dividing developers in those born before the Unix timestamp of zero .. and after. Spoiler: I'm before.
chacham remebers I am a mad scientist
[stevieb]: talexb I'm after by a few years, but we can still be friends because you live in my hometown ;)
[LanX]: White House

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2017-08-18 13:26 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (302 votes). Check out past polls.