Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Regex result being defined when it shouldn't be(?)

by chenhonkhonk (Acolyte)
on Nov 14, 2017 at 15:20 UTC ( #1203386=perlquestion: print w/replies, xml ) Need Help??

chenhonkhonk has asked for the wisdom of the Perl Monks concerning the following question:

Perl 5.26, Strawberry on Windows I'm writing a parser to decode an options file effectively into perl-like assignments. A primitive config might look like:

@arr[2] = 5
var = 10

The example regex block is:

sub parse_options { # \%options_we_can_write, \@lines_of_options_fi +le. [ $die_on_extra_options ] #loads all matching values from an array of lines into the options + hash #uses '=' as the separator, one ( 1 ) space on either side my ($options_we_can_write, $options_file_lines, $die_on_extra_opti +ons) = @_; my $var = ""; my $sigil = ""; my $val = ""; my $setter = ""; my $arraything = ""; my $fd = ""; my $rd = ""; my $index = ""; if( ! %{$options_we_can_write} ){ die "no options to match\n" } foreach ( @{ $options_file_lines } ){ if( $_ =~ m/ ^([@%\$]?) # open sigil ([a-zA-Z_]+) # variable (\[)? #optional open bracket ([\d]*) #optional number (])? #optional close bracket \ # space = # equals \ # space ([a-zA-Z0-9_\\\/:]+)$ # string until end of line /x #end regex ){ #startif #print "##$1## ##$2## ##$3## ##$4## ##$5## ##$6##\n"; #com +plains about undefined values if( defined ($sigil = $1) eq "" ){ $sigil = '$' } print "####sigil $sigil ##>$1<##\n"; $var = $2; if( defined ($fd = $3) eq "" ){ $fd = "" }; if( defined ($index = $4) eq "" ){ $index = "" } if( defined ($rd = $5) eq "" ){ $rd = "" }; if ( ($fd eq '[' and $rd ne ']') or ($fd eq "{" and $rd ne + '}') ){ print STDERR "Mismatched delimiters for $var, skipping +\n"; next; } if ( ($fd eq '[' and $rd eq ']') and ( ! is_whole_number($ +index) ) ){ print STDERR "array assignment $var needs whole number + index\n"; next; } # if ($fd eq "{" and $rd != '}' ){ # print STDERR "Error bad delimiter for $var"; # } $setter = "$sigil\{ \$options_we_can_write->{$var} }$fd$in +dex$rd = $val"; print "$setter\n"; if( exists $options_we_can_write->{"$var"} ){ eval $setter; } next; } if( $_ =~ m/^(.*?) = (.*)$/ ){ $var = $1; $val = $2; if( exists $options_we_can_write->{"$var"} ){ ${ $options_we_can_write->{"$var"} } = $val; } else { print "Error: desired option $var not found\n"; } } #print "$_\n"; } return 0; }
It's obviously in-work, but $1 seems to always be being set to "" if there's no match and not undef as the other values are. For example:
if( defined ($fd = $3) eq "" ){ $fd = "" };
can be changed to:
if( defined ($fd = $3) eq "" ){ $fd = "ssssssssss" };
and the ssssssss gets shown in 'print $setter' for an assignment without a front-delimiter. I first thought it was because I had "\$" vs '$' but that made no difference. If I explicitly assign $sigil right after the defined line it is assigned, but the else is never executed for $1, unlike $3 and $5. There are no extra characters being passed (\r, \n, etc)

Replies are listed 'Best First'.
Re: Regex result being defined when it shouldn't be(?)
by haukex (Bishop) on Nov 14, 2017 at 15:36 UTC

    I haven't fully evaluated or tested your code, but a couple of comments and, if I understood correctly, the answer to your question:

    • For longer regexes, next to /x as you're already doing, I strongly recommend using different delimiters and especially named capture groups (perlre) and %+, as in:
      my $regex = qr{ (?<foo> fo+ ) }msx; "barfooobar" =~ $regex; print "<",$+{foo},">\n"; # prints "<fooo>"
    • In the regex you showed, if the overall regex matches, then ([@%\$]?) and ([\d]*) will not return undef but the empty string "", because those capture groups will always match at least () (that is, the empty string "").
    • Your expressions like if( defined ($sigil = $1) eq "" ) don't make much sense to me, because you're testing the return value of defined, a boolean value, against the empty string. If you just want to check for definedness, then write if( defined($sigil = $1) ), and if you want the assignment to $sigil to happen only if $1 is defined, then write if( defined $1 ) { $sigil = $1; ...
    • You might be interested in my module Config::Perl ;-)
      P.p.s: After thinking about why I would've been using the quantifiers outside vs inside, separate from maybe capturing only one repetition of a group, I figured it out:

      Alternations. If you wanted a word among multiple choices but only 0-1 times you have a sort of choices:
      (this|that|third_thing)? ((this)?|(that)?|(third_thing)?)
      The first one is pretty clear, I want 0 or 1 of any of those words. It will return undef if I have 0.

      The second one, I don't even trust it. I think I could match all 3 if they happen in a row. Additionally, there's probably 4 capture groups created as a result.

      A quick search on if I had used 'alternation' properly: https://docstore.mik.ua/orelly/perl4/prog/ch05_08.htm
      "When you apply the ? to a subpattern that captures into a numbered variable, that variable will be undefined if there's no string to go there. If you used an empty alternative, it would still be false, but would be a defined null string instead."
        The second one, I don't even trust it. I think I could match all 3 if they happen in a row.

        No, it's fine, it reads like so: Match one of the three choices: "this" or "", "that" or "", or "third_thing" or "". Just like in your first example, the parentheses and alternation operator make sure that it will match only one of the three choices at that place in the regex.

        Additionally, there's probably 4 capture groups created as a result.

        Correct, but you can use non-capturing (?: ) parens to avoid that, i.e. ((?:this)?|(?:that)?|(?:third_thing)?) would make it have only one capturing group, like your first example. <update> And AnomalousMonk made an excellent point about (?| ) here. </update>

        I'd recommend a read of perlrequick, perlretut, and perlre for all of these features and the ones I mentioned earlier. Also, for playing around with regexes and testing out what they do, see my post here.

        ((this)?|(that)?|(third_thing)?)
        ...
        ... I don't even trust it. ... there's probably 4 capture groups created as a result.

        Just as an aside, the  (?|(pat)|(te)|(rn)) "branch reset" pattern introduced with Perl version 5.10 will suppress the creation of a slew of captures in a case like this:

        c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $s = 'apathetic'; ;; my @captures = $s =~ m{ (pat) | (te) | (rn) }xms; dd \@captures; ;; @captures = $s =~ m{ (?| (pat) | (te) | (rn)) }xms; dd \@captures; " ["pat", undef, undef] ["pat"]
        See Extended Patterns in perlre.


        Give a man a fish:  <%-{-{-{-<

      I'm not doing something if it is defined, I'm doing it if it's NOT defined.
      An annoyance to me (I come from a C background) is a variable failing to be defined does NOT return 0 or a 'FALSE' definition, it returns "". For safety reasons and explicitness, I program in the explicit results of tests i.e. defined $var eq "" or defined var ne "". Using simply 'defined $var' and '! defined $var' isn't as clear as what Perl is doing internally.

      If I do print "$3" from a match on 'var = 10' I do not get the same as print "". Regex DO NOT return "" on failing to match, they return undef. After further testing, it appears the difference is where the quantifier comes in:
      use strict; use warnings; my $string = "string"; if( $string m/([5]?)string/ ){ print "? inside group: $1\n"; #prints fine } if( $string m/([5])?string/ ){ print "? outside group: $1\n"; #Use of uninitialized value $1 in c +oncatenation (.) or string... } return 0;
      P.s. the reason I'm doing this manually is because I'm making it as portable as possible and sensible to me. I'm running Perl on Windows 7/8/10, modern Linux, a Debian 2.6.32, etc. Production environment with too many distributions, internal/external network, all that jazz. I already had an issue where a CPAN module I would've liked had some Linux-only make commands.
        An annoyance to me (I come from a C background) is a variable failing to be defined does NOT return 0 or a 'FALSE' definition, it returns "".

        Actually, that's not exactly what is going on. Perl has a special "false" value that is 0 when used in numeric context and "" in a string context, so in Perl if (boolean) and if (!boolean) are actually "explicit" tests for truth and falsehood for functions that return "true" and "false" values (this applies to just about every builtin, of course there are some rare special cases). Have a look at Truth and Falsehood. Once you get used to this, I hope you'll find if (!defined(...)) (or any of its variants like if (not defined(...)) or unless (defined(...))) more natural. At least personally, I was initially confused when I read if ( defined($x = $1) eq "" ), and I thought you might accidentally be misapplying an idiom like if ( (my $x = $1) eq "foo" ) (which does the assignment and then the comparison).

        If I do print "$3" from a match on 'var = 10' I do not get the same as print "". Regex DO NOT return "" on failing to match, they return undef. After further testing, it appears the difference is where the quantifier comes in:

        Right, which is why I left your $3, that is (])?, out of my explanation, and explicitly referred to your $1 (([@%\$]?)), which you were asking about :-)

        ... portable ... I already had an issue where a CPAN module I would've liked had some Linux-only make commands.

        According to CPAN Testers, Config::Perl runs on Linux, MSWin32, Cygwin, Darwin (Mac OS X), and various *BSD, and from Perl versions 5.8.1 thru 5.26.1.

        Update 2019-08-17: Updated the link to "Truth and Falsehood".

        defined $var or equivalently defined($var) will return the integer 1 (which is a TRUE value; 1 is also TRUE in C, so this shouldn't confuse you) if the variable is defined. It will return undef (which is a FALSE value) a FALSE value (see haukex's answer) if the variable is undefined. You then take that value, either 1 or undefthe FALSE value, and stringify it. The integer 1 stringifies into "1". The FALSE value undef stringifies into "". If you don't want undef FALSE to become "", don't stringify. (The eq operator is forcing the stringification on both its arguments.)

        If you really just want a boolean that decides whether the $var is defined or not, just use the truthiness of the result of defined $var -- that is explicitly the boolean test for whether the $var is defined, and the defined $var and !defined $var syntax are explicitly saying "variable is defined" and "variable is not defined". This is similar to C: if you define a function int is_five(int x) { return (x==5); }, then the return value of is_five(var) and !is_five(var) are explicit ways of testing whether or not the variable is 5. From your claim, in C, I would have to write is_five(var)==-1 to verify that var is 5, and is_five(var)==0 to verify that var is not 5, which I vehemently disagree with: that notation obfuscates what c is doing, not clarifies what it's doing internally. Just trust that Perl will do the right thing with boolean expressions in a boolean context, just like you trust that C does the right thing with boolean results in a boolean context.

        if it's the lack of parentheses that are confusing you, then use the parentheses.

        Aside: Urgh... I did one last refresh before hitting create, and saw that haukex beat me by a minute or two again. :-(. I went to all the trouble of writing this up, so I'll hit create anyway.

        update: I was wrong: defined($var) doesn't return undef or 1; it returns the special value, as haukex said.

        c:> perl -le "print defined($x)//'<undef>'; print defined($x)||'untrue +'" untrue c:>

Re: Regex result being defined when it shouldn't be(?)
by choroba (Archbishop) on Nov 14, 2017 at 16:42 UTC
    There are modules on CPAN that can help you building a parser. For example, you can use Marpa::R2 in the following way:
    #!/usr/bin/perl use warnings; use strict; use Marpa::R2; my $dsl = << '__DSL__'; lexeme default = latm => 1 :default ::= action => ::first Config ::= Assignment Config action => merge | Assignment action => creat +e_config Assignment ::= Var (space equals space) Value (space) action => assig +n Value ::= number | String String ::= (quote) Quoteds (quote) Quoteds ::= Quoted Quoteds action => conca +t | Quoted Quoted ::= nonquote | quotedquote action => quote Var ::= Name || Array Array ::= atsign Name Index action => name_ +index Name ::= alpha alnum action => conca +t Index ::= (leftsquare) number (rightsquare) space ~ [\s]* alnum ~ [\w]+ alpha ~ [[:alpha:]] atsign ~ '@' equals ~ '=' leftsquare ~ '[' nonquote ~ [^'] number ~ [\d]+ quotedquote ~ '\'['] quote ~ ['] rightsquare ~ ']' __DSL__ sub concat { $_[1] . $_[2] } sub name_index { [ $_[2], $_[3] ] } sub quote { "'" } sub assign { [ ref $_[1] ? @{ $_[1] } : $_[1], $_[2] ] } sub merge { my %config = %{ $_[-1] }; (2 == @{ $_[1] } ? $config{ $_[1][0] } : $config{ $_[1][0] }[ $_[1 +][1] ]) //= $_[1][-1]; return \%config } sub create_config { my %config; $config{ $_[1][0] } = @{ $_[1] } == 2 ? $_[1][1] : do { my $ar = []; $ar->[ $_[1][1] ] = $_[1][2]; $ar }; \%config } my $grammar = 'Marpa::R2::Scanless::G'->new({source => \$dsl}); my $input = do { local $/; <DATA> }; my %config = %${ $grammar->parse(\$input, 'main') }; use Data::Dumper; print Dumper \%config __DATA__ @arr[2] = 3 str = 'xyz' @arr[2] = 5 str = 'abc\'d' @arr[1] = '#+#:'

    Output:

    $VAR1 = { 'str' => 'abc\'d', 'arr' => [ undef, '#+#:', '5' ] };
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1203386]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2020-11-30 20:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?