Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

difference in regex

by ovedpo15 (Monk)
on May 29, 2018 at 12:56 UTC ( #1215361=perlquestion: print w/replies, xml ) Need Help??

ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys
Consider the following string: "a,b,c,d,5"
The format of the string is like this: "substr,substr,substr,...,value"
I use regex to check the string:

my ($value) = ($row =~ /.*,(.*)/); # gets value after the last comma if (looks_like_number($value)) { ($row =~ s/,[^,]*$//); # gets substring before the last comma # DO STUFF ... } # DO STUFF ...
It works fine but it doesn't look very good.
I cant understand why in my ($value) = ($row =~  /.*,(.*)/); I need the brackets on the scalar but in ($row =~ s/,[^,]*$//); I don't need.
In other words, why is there is a syntax difference between the following two lines:

my ($value) = ($row =~ /.*,(.*)/); my ($val) = ($row =~ s/,[^,]*$//);


Testing: my $row = "a,b,c,d,15";
Output of first line: 15
Output of second line: 1 (why not a,b,c,d?)
How to do it in the same way?

Replies are listed 'Best First'.
Re: difference in regex
by haukex (Chancellor) on May 29, 2018 at 14:03 UTC

    You will find the answer to your question in "Regexp Quote-Like Operators" in perlop - basically, different regex operations have different return values in different contexts. See also perlretut for a tutorial.

    Operation Context () Capturing
    Groups
    Return Value on Match
    (and notes on behavior)
    Return Value on Failure Example
    m// scalar - true false
    my $x = "foobar"=~/[aeiou]/; # => $x is true my $y = "foobar"=~/[xyz]/; # => $y is false
    m//g scalar - true
    (each execution of m//g finds the next match,
    see "Global matching" in perlretut)
    false if there is no further match
    my $str = "foobar"; my $x = $str=~/[aeiou]/g; # matches first "o" => $x is true, pos($str) is 2 $x = $str=~/[aeiou]/g; # matches second "o" => $x is true, pos($str) is 3 $x = $str=~/[aeiou]/g; # matches "a" => $x is true, pos($str) is 5 $x = $str=~/[aeiou]/g; # no more matches => $x is false, pos($str) is undef
    m// list no the list (1) the empty list ()
    my ($x) = "foobar"=~/[aeiou]/; # => $x is 1
    m//g list no a list of all the matched strings, as if there were parentheses around the whole pattern the empty list ()
    my ($x,$y,$z) = "foobar"=~/[aeiou]/g; # => $x is "o", $y is "o", $z is "a"
    m// list yes a list consisting of the subexpressions matched by the parentheses in the pattern, that is, ($1, $2, $3...) the empty list ()
    my ($x,$y) = "foobar"=~/([aeiou])(.)/; # => $x is "o", $y is "o"
    m//g list yes a list of the substrings matched by any capturing parentheses in the regular expression, that is, ($1, $2...) repeated for each match the empty list ()
    my ($w,$x,$y,$z) = "foobar"=~/([aeiou])(.)/g; # => $w is "o", $x is "o", $y is "a", $z is "r"
    s/// - - the number of substitutions made false
    my $x = "foobar"; my $y = $x=~s/[aeiou]/x/g; # => $y is 3
    s///r - - a copy of the original string with substitution(s) applied
    (available since Perl 5.14)
    the original string
    my $x = "foobar"=~s/[aeiou]/x/gr; # => $x is "fxxbxr"

    In this table, "true" and "false" refer to Perl's notion of Truth and Falsehood. Remember not to rely on any of the capture variables like $1, $2, etc. unless the match succeeds!

    In my $foo = "bar"=~/a/;, the right-hand side of the assignment ("bar"=~/a/) is in scalar context. In my ($foo) = "bar"=~/a/; or my @foo = "bar"=~/a/;, the right-hand side is in list context. That's why, in your example, you need those parens in ($value): because you want the matching operation to return the contents of the capture group.

    Note that your expressions can be slightly simplified, not all the parens you showed are needed:

    my ($value) = $row =~ /.*,(.*)/; # and $row =~ s/,[^,]*$//;

    A few additional comments on your code:

    • ($row =~ s/,[^,]*$//); # gets substring before the last comma - this comment isn't quite right or at least potentially misleading, since it deletes the string before after and including the last comma.
    • /.*,(.*)/ matches any comma anywhere in the string, for simple input strings it may behave correctly, but I'd strongly recommend coding more defensively and writing it like your second expression: my ($value) = $row=~/,([^,]*)$/; - the $ anchor makes sure that the regex only matches the last comma and what follows it (unless you use the /m modifier, since it changes the meaning of $).
    • While the use of Scalar::Util's looks_like_number is often a good idea, note that if you don't mind being a little more restrictive, Regexp::Common (or a hand-written regex) would allow you to combine the two regular expressions:
      use Regexp::Common qw/number/; my $row = "a,b,c,d,15"; if ( $row=~s/,($RE{num}{real})$// ) { print "matched <$1>\n"; } print "row is now <$row>\n"; __END__ matched <15> row is now <a,b,c,d>
    • If this is a CSV file, consider using Text::CSV (also install Text::CSV_XS for speed)

    Update: Added s///r to the table and added a few more doc links. A few other edits and updates. 2019-02-16: Added "Return Value on Failure" column to table, and a few other small updates. 2019-08-17: Updated the link to "Truth and Falsehood".

      Thank you for the replay!
      As I mentioned on one of the posts on this thread - I would like to split it somehow into two scalars. I can use my ($a,$b) = ($row=~ /(.*),(.*)/); But if $row doesn't have commas it won't work. how do I make always put a string into $path
      for example:
      if "abc" it will be $path = "abc" and $value is undefined.
      if "abc,5" it will be $path = "abc" and $value = 5
      if "a,b,c,5" it will be $path = "a,b,c" and $value = 5

        Although personally I'd still use a conditional, of course it's possible to do it all in one regex. One way is by making the comma optional by putting a ? on a group, in this case I'm using a non-capturing (?:...) group, and I had to make the first part of the regex non-greedy so that it doesn't swallow an existing comma:

        use warnings; use strict; use Test::More; my $regex = qr/ ^ (.*?) (?: , ([^,]*) )? $ /x; ok "abc"=~$regex; is $1, "abc"; is $2, undef; ok "abc,5"=~$regex; is $1, "abc"; is $2, 5; ok "a,b,c,5"=~$regex; is $1, "a,b,c"; is $2, 5; done_testing;

        Update: An alternative that says a little more explicitly: either match a string with no commas in it, or, if there are commas, I want to match the thing after the last one: /^ (?| ([^,]*) | (.*) , ([^,]*) ) $/x Update 2: And it turns out this regex is much faster than the above! (try using it in this benchmark)

Re: difference in regex
by Athanasius (Bishop) on May 29, 2018 at 13:42 UTC

    Hello ovedpo15,

    In Perl, a function’s return value(s) may be different depending on the context in which the function is called. (Whether they are or not depends on the internal details of the function itself.) The statement my $value = $row =~ /.*,(.*)/; calls the regex operator m// in scalar context, so it returns true if the match succeeds and false if it fails. But in the statement my ($value) = $row =~ /.*,(.*)/; the parentheses around $value put the call to m// into list context and a list of the matches is returned.

    By contrast, the substitution operator s/// returns the number of substitutions made regardless of the calling context. But you can change this behaviour by adding an /r modifier to the substitution. This creates a copy of the string (in this case $row), applies the substitutions (if any) to the copy, and returns that copy. E.g.

    my $row = 'a,b,c,d,15'; my $str = $row =~ s/,[^,]*$//r; # $str now contains 'a,b,c,d'

    See the sections m/PATTERN/msixpodualngc and s/PATTERN/REPLACEMENT/msixpodualngcer in perlop#Regexp-Quote-Like-Operators.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thank you for the fast replay.
      I tried to use the following regex  my($path,$value) = ($row =~ /(.*),(.*)/); to split the string.
      but if there are no commas it won't work. Which regex should I use in order to always put the string into $path so I can only check if $value is defined?
      for example:
      if "abc" it will be $path = "abc" and $value is undefined.
      if "abc,5" it will be $path = "abc" and $value = 5
      if "a,b,c,5" it will be $path = "a,b,c" and $value = 5

      The algo I would like to implement :
      As I see it the steps are:
      1. if the string has commas:
      1.a. get the last comma and check if the last substring is a number - if so put it in hash like this: $hash{$path} = $value;
      1.b. if the substring after the last comma isn't a number - $hash{$path} = 1;
      2. if string has no commas: $hash{$string} = 1;


      how to implement this?

        Because ($path, $value) is a list, you get the list of submatches (list context). But if you do something like:

        if ($row =~ /(.*),(.*)/) { ... }
        since the if expects a boolean, the operation will return true if something matches, and false otherwise (boolean context). And you can still access the left and right part as $1 and $2. So you can do:
        my $path = $row; # path is the full string by default if ($row =~ /(.*),(.*)/) { my $left_part = $1; my $value = $2; # Check if $value is a number and change $path if needed ... }

        The algo I would like to implement :
        As I see it the steps are:
        1. if the string has commas:
        1.a. get the last comma and check if the last substring is a number - if so put it in hash like this: $hash{$path} = $value;
        1.b. if the substring after the last comma isn't a number - $hash{$path} = 1;
        2. if string has no commas: $hash{$string} = 1;

        Although a good start, point 1.b. is unclear: in this case, do you want the whole string stored in $path, or just the part up until the last comma? For now I'm assuming the latter. Anyway, while there may always be "nicer" ways to write things in Perl (Update: and you haven't specified what you meant with "it doesn't look very good"), sometimes a good starting point is a direct translation:

        use warnings; use strict; use Scalar::Util qw/looks_like_number/; use Data::Dumper; # Debug my %hash; while (my $string = <DATA>) { chomp($string); # check if string has at least one comma, and at the same # time extract the value after the last comma if ( my ($path,$value) = $string=~/^(.*),([^,]*)$/ ) { if ( looks_like_number($value) ) { $hash{$path} = $value; } else { $hash{$path} = 1; } } else { $hash{$string} = 1; } } print Dumper(\%hash); # Debug __DATA__ foo bar,x quz,5 a,b,c,42

        Of course there's lots of potential for shortening that, e.g. by combining it with my example code from here. Update: A really simple shortening:

        while (<DATA>) { chomp; if ( /^(.*),([^,]*)$/ ) { $hash{$1} = looks_like_number($2) ? $2 : 1 } else { $hash{$_} = 1 } }
Re: difference in regex
by haj (Chaplain) on May 29, 2018 at 13:58 UTC

    A regular expression can tell you two things: Whether there's a match at all, and what some parts in the match are. You are using it in different ways, on two levels:

    $row =~  /.*,(.*)/ is a pattern match. It returns whether $row contains the pattern. If you have parentheses in the regex (and you have), then the part of the match within the parentheses is captured - and if you evaluate the pattern match in list context, these captures will be returned as a list. By writing my ($value) you create a list context, therefore you get whatever matched after the last comma.

    $row =~ s/,[^,]*$// is a substitution s/text/pattern/. Substitutions change the variable they operate upon, and they return the number of substitutions made, regardless of context. Hence the 1 in the second line: One substitution. You get the substring before the last comma in the variable $row by deleting the last comma and whatever follows it.

    If you want the second example to behave like the first, add a capture, and replace the substitution by a match, like this:
    my ($val) = ($row =~ /(.*),[^,]*$/);

    A good reference for all this, and a lot more, is perlretut.
Re: difference in regex
by wjw (Priest) on May 29, 2018 at 13:57 UTC

    I sometimes go to the following to remind myself of how pattern matching works. It is usually enough to jar loose something in my cluttered memory to get me going. Cluttered Memory Shaker

    ...the majority is always wrong, and always the last to know about it...

    A solution is nothing more than a clearly stated problem...

Re: difference in regex
by Veltro (Friar) on May 29, 2018 at 13:15 UTC

    Hi ovedpo15

    A couple of really good examples regards these kind of matters have been posted recently here and here

Re: difference in regex
by haukex (Chancellor) on May 30, 2018 at 13:47 UTC

    Just for fun and TIMTOWTDI: If you happen to want speed, use rindex and substr.

    my $str = "a,b,c,5"; my $i = rindex $str, ','; my ($path,$value); if ($i<0) { $path=$str } else { $path = substr($str, 0, $i); $value = substr($str, $i+1) }
Re: difference in regex
by (anonymized user) (Curate) on May 30, 2018 at 10:46 UTC
    There is a syntax difference between the two because there is a syntax difference between the two. If you mean "why is there a difference in result", I would say that the first returns the result of the expression in round brackets, which is the last match, whereas the second, having no () returns whether or not a match was found. The second also performs substitution, unlike the first. To get a,b,c,d you simply want to place ^(.*) ahead of the ending match to trap it ... I am dropping the s/ in this suggestion because it is a side effect so far not justified in the OP. Also I tend to escape punctuation because it might have special meaning, so ...
    my ($val) = ($row =~ /^(.*)\,[^\,]*$/);
    (untested)
Re: difference in regex
by sundialsvc4 (Abbot) on May 29, 2018 at 14:27 UTC

    (After dropping up-votes on every single comment in this thread up to now ...)

    If what you literally want to do is to “split the string by commas and take the last piece,” what I would probably have done is to first split the string on a comma, then pop the last entry off the resulting array.   This will work whether-or-not there is actually a comma in the string, since in that case the array will contain only one entry.   I would prefer this approach because it represents a literal interpretation of how you originally described your objective, and because it’s how I am accustomed to see this sort of thing being done most of the time.   (The split function has many useful features read the doc page in its entirety.)

      what I would probably have done is to first split the string on a comma, then pop the last entry off the resulting array.

      As in:

      my @array = split /,/, $string; my $value = pop @array;

      But you don't really need an array to do that because you can get the last value directly from the list that split returns:

      my $value = ( split /,/, $string )[ -1 ];

        “++”   Yes, that is a useful observation.   Thanks.

        Unrelated to your suggestion, I would also point out that the join function might be usefully applied to the array once you have popped the rightmost element off of it.   If, for instance, you wanted to obtain “a string consisting of the remaining elements if-any, separated by a comma or what-have-you,” join is just what the doctor ordered.

      I'm all for TIMTOWTDI, but I wondered about performance. And it turns out that while a split version is faster for short strings, performance suffers a lot the more commas there are in the string:

      use warnings; use strict; use Benchmark qw/cmpthese/; my $str = join ',', 'a'..'z', 5; my $exp_path = join ',', 'a'..'z'; my $exp_value = 5; cmpthese(-2, { split => sub { my @x = split /,/, $str; my ($path, $value); $value = pop @x if @x>1; $path = join ',', @x; #die unless $path eq $exp_path && $value eq $exp_value; }, regex => sub { my ($path, $value) = $str=~/^ (?| ([^,]*) | (.*) , ([^,]*) ) $/x; #die unless $path eq $exp_path && $value eq $exp_value; }, } ); __END__ Rate split regex split 326127/s -- -66% regex 966896/s 196% --

      Update: This version using rindex beats both split and the regex.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1215361]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2020-01-22 01:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?