Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Regex result being defined when it shouldn't be(?)

by chenhonkhonk (Acolyte)
on Nov 14, 2017 at 16:46 UTC ( [id://1203401]=note: print w/replies, xml ) Need Help??


in reply to Re: Regex result being defined when it shouldn't be(?)
in thread Regex result being defined when it shouldn't be(?)

P.p.s: After thinking about why I would've been using the quantifiers outside vs inside, separate from maybe capturing only one repetition of a group, I figured it out:

Alternations. If you wanted a word among multiple choices but only 0-1 times you have a sort of choices:
(this|that|third_thing)? ((this)?|(that)?|(third_thing)?)
The first one is pretty clear, I want 0 or 1 of any of those words. It will return undef if I have 0.

The second one, I don't even trust it. I think I could match all 3 if they happen in a row. Additionally, there's probably 4 capture groups created as a result.

A quick search on if I had used 'alternation' properly: https://docstore.mik.ua/orelly/perl4/prog/ch05_08.htm
"When you apply the ? to a subpattern that captures into a numbered variable, that variable will be undefined if there's no string to go there. If you used an empty alternative, it would still be false, but would be a defined null string instead."

Replies are listed 'Best First'.
Re^3: Regex result being defined when it shouldn't be(?)
by haukex (Archbishop) on Nov 14, 2017 at 17:12 UTC
    The second one, I don't even trust it. I think I could match all 3 if they happen in a row.

    No, it's fine, it reads like so: Match one of the three choices: "this" or "", "that" or "", or "third_thing" or "". Just like in your first example, the parentheses and alternation operator make sure that it will match only one of the three choices at that place in the regex.

    Additionally, there's probably 4 capture groups created as a result.

    Correct, but you can use non-capturing (?: ) parens to avoid that, i.e. ((?:this)?|(?:that)?|(?:third_thing)?) would make it have only one capturing group, like your first example. <update> And AnomalousMonk made an excellent point about (?| ) here. </update>

    I'd recommend a read of perlrequick, perlretut, and perlre for all of these features and the ones I mentioned earlier. Also, for playing around with regexes and testing out what they do, see my post here.

      I've already read about regexs. From a book. Most of the time each source doesn't explicitly bring up the full exceptions or they use jargon (like alternations.. and not 'or')

      I don't know if I intend offense or not, but comparing something like:
      (?:this)?|(?:that)?|(?:third_thing)?) if( ! $1 ){} #or defined, but why bother #nvm, see below vs $_ = /(this|that|third_thing)?/; if( defined $1 eq "" ){}
      Seems like there's a huge difference on readability, not even getting into when you have many alternations. Even getting rid of eq "".

      And of course "" and "0" are defined but not a true value, so if you were looking for numeric characters, you can't use if($1). I guess I need to try raw values and see how Perl handles \0 in true/false/define settings

      Edit: ARGH: defined(undef) == 0, and defined(undef) eq "", but "" == 0 isn't a numeric comparison, and "" eq 0 is false, as is "" eq "0". undef == 0 is true but produces a warning, while undef eq 0 is false.

      I'm putting those there in case anyone else ever comes across this bonanza of "false" comparisons.
        Seems like there's a huge difference on readability, not even getting into when you have many alternations.

        Definitely, but there are some mechanisms to make regexes more readable, such as /x (as you're already using) and the things I mentioned here, including precompiled regexes via qr//, which you can interpolate into other regexes, Building Regex Alternations Dynamically, or even advanced features like (?(DEFINE) ...) (perlre).

        my $re1 = qr{ ... }msx; my $re2 = qr{ ... }msx; my $big_re = qr{ (?: $re1 | $re2 ) }msx;
        so if you were looking for numeric characters, you can't use if($1)

        As far as I can tell from what you're written so far, you seem to be very interested in whether a capture group matched something or not. This should make named capture groups, as I mentioned before, more interesting:

        use warnings; use strict; use Data::Dump qw/dd/; # for debugging my $re = qr{ ^ \s* # beginning of line (?<name> \w+ ) # the variable name \s* = \s* # equals (?: # one of the following ( (?<num> \d+ ) # a number | # or (?<str> \w+ ) # a word ) # ) \s* $ # end of line }msx; my @lines = split /\n/, <<'SAMPLE_INPUT'; foo=bar quz = 5 SAMPLE_INPUT for my $line (@lines) { $line =~ $re or die "Failed to parse '$line'"; dd \%+; # debug print "Match! Name: '$+{name}'\n"; if (exists $+{num}) { print "It was a number: '$+{num}'\n" } elsif (exists $+{str}) { print "It was a string: '$+{str}'\n" } else { die "internal error: neither str nor num" } } __END__ { # tied Tie::Hash::NamedCapture name => "foo", str => "bar", } Match! Name: 'foo' It was a string: 'bar' { # tied Tie::Hash::NamedCapture name => "quz", num => 5, } Match! Name: 'quz' It was a number: '5'

        Update: I'm not sure when you made your "Edit" but I didn't see it until later. The explanation for the behavior you are seeing is this (note I'm ignoring overloading here):

        • Numeric comparisons like ==, !=, >, etc. cause their arguments to be taken as numbers. This means:

          • undef is converted to 0 but is subject to a warning.
          • "" is not a number so it is subject to a warning, and is converted to 0.
          • "0" is converted to 0.
          • 0 is already a number and doesn't need to be converted.
          • Perl's "false" (!1, including defined(undef)) already has a numeric value of 0, so that is used.
          • Perl will attempt to convert any other string into a number, warning if it cannot do so cleanly. The string "0 but true" is special-cased to be exempt from this warning.
        • String comparisons like eq, ne, gt etc. cause their arguments to be taken as strings. That means:

          • undef is converted to "" but is subject to a warning.
          • "", "0", and "0 but true" are already strings and don't need to be converted.
          • 0 is converted to "0", and of course any other number is stringified.
          • Perl's "false" (!1, including defined(undef)) already has a string value of "", so that is used.

          This is why "" eq 0 and undef eq 0 are false, because they're both the same as "" eq "0".

        See Relational Operators and Equality Operators. As for why you shouldn't use these operators to check boolean values, I've already explained that elsewhere.

Re^3: Regex result being defined when it shouldn't be(?)
by AnomalousMonk (Archbishop) on Nov 14, 2017 at 19:00 UTC
    ((this)?|(that)?|(third_thing)?)
    ...
    ... I don't even trust it. ... there's probably 4 capture groups created as a result.

    Just as an aside, the  (?|(pat)|(te)|(rn)) "branch reset" pattern introduced with Perl version 5.10 will suppress the creation of a slew of captures in a case like this:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $s = 'apathetic'; ;; my @captures = $s =~ m{ (pat) | (te) | (rn) }xms; dd \@captures; ;; @captures = $s =~ m{ (?| (pat) | (te) | (rn)) }xms; dd \@captures; " ["pat", undef, undef] ["pat"]
    See Extended Patterns in perlre.


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1203401]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2025-11-18 03:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (72 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.