Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

split versus =~

by russlo (Novice)
on Nov 10, 2022 at 15:12 UTC ( #11148100=perlquestion: print w/replies, xml ) Need Help??

russlo has asked for the wisdom of the Perl Monks concerning the following question:

The code (running under Perl 5, version 26, subversion 3) that I am working on is receiving a string from a database that is formatted like so: "X-Y-Z" where X, Y, and Z are all numbers. Originally, and for the longest time, X, Y, and Z were all positive numbers. Now there has been a request to allow for negative X's.

An example of the original code is as follows:

my ($X, $Y, $Z) = split("-", $data);

The code would then operate on $X, $Y, and $Z, as well as later using $data once more, in order to provide it's final output. As there is now the possibility of negative values for X, splitting on a "-" has become problematic.

I proposed the following code:

my ($X, $Y, $Z) = ($data =~ /(-?\d*)-(\d*)-(\d*)/);

using a regular expression to more fully describe the data set that is possibly found in $data, however, this has been found to not work exactly the same as the previous usage of split.

My question is: why? Additionally: what can I do to provide the correct splitting that we're looking for here? Thank you in advance.

Replies are listed 'Best First'.
Re: split versus =~
by davido (Cardinal) on Nov 10, 2022 at 16:09 UTC

    I can't think of a case where positive lookbehind to detect a digit preceding the hyphen wouldn't work:

    #!/usr/bin/env perl use strict; use warnings; use Test::More; my @tests = ( [ '1-2-3-4' => [1,2,3,4] ], [ '1--2-3-4' => [1,-2,3,4] ], [ '-1-2-3-4' => [-1,2,3,4] ], [ '1-2-3--4' => [1,2,3,-4] ], [ '-1--2--3--4' => [-1,-2,-3,-4] ], [ '15--20-25--30' => [15,-20,25,-30] ], ); foreach my $test (@tests) { my ($raw, $want) = @$test; my @got = split /(?<=\d)-/, $raw; local $" = ', '; is_deeply \@got, $want, "'$raw' => [@$want]"; } done_testing();

    This produces:

    ok 1 - '1-2-3-4' => [1, 2, 3, 4] ok 2 - '1--2-3-4' => [1, -2, 3, 4] ok 3 - '-1-2-3-4' => [-1, 2, 3, 4] ok 4 - '1-2-3--4' => [1, 2, 3, -4] ok 5 - '-1--2--3--4' => [-1, -2, -3, -4] ok 6 - '15--20-25--30' => [15, -20, 25, -30] 1..6

    Dave

      These are great and pretty similar to what I would have supplied myself if I had thought the description of the problem wasn't self-explanatory. The actual data is possibly sensitive, so I can't just copy and paste it here. The one set of cases that you didn't have was where a number was missing from the input set ( '-2-3-4' ) - but I think I will actually look for and prevent that in the SQL statement looking for the data coming from the database.

      If the positive lookbehind doesn't work then I think I'll have to blame some other unknown causing the actual issue with the outputs being unexpected. I have a meeting later to determine if there is some other unknown cause here, hopefully I can talk it through with those individuals then and arrive at a conclusion.

      Thank you.

        There's no number missing from '-2-3-4'; that's "-2, 3, 4". If it isn't, I don't think the rules can be consistent.


        Dave

Re: split versus =~
by tybalt89 (Monsignor) on Nov 10, 2022 at 16:16 UTC

    Like this?

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11148100 use warnings; for ( glob join '-', ('{-,}42') x 3 ) { my ($X, $Y, $Z) = /(-?\d+)-(-?\d+)-(-?\d+)/; printf "%11s => %3s %3s %3s\n", $_, $X, $Y, $Z; }

    Outputs:

    -42--42--42 => -42 -42 -42 -42--42-42 => -42 -42 42 -42-42--42 => -42 42 -42 -42-42-42 => -42 42 42 42--42--42 => 42 -42 -42 42--42-42 => 42 -42 42 42-42--42 => 42 42 -42 42-42-42 => 42 42 42
      for ( glob join '-', ('{-,}42') x 3 )

      I like that glob ... to generate a list (as I have not personally seen it used much outside of collecting file paths).

Re: split versus =~
by talexb (Chancellor) on Nov 10, 2022 at 18:19 UTC

    You've already got plenty of answers here, but I recommend the thing you take away from this discussion is "I dunno -- let's write a test to check out all of the possible inputs and all of the corresponding outputs that we expect to see."

    That way, when someone says, "I'm not sure it's working correctly" about your code, you can pull out your tests, and see whether you've covered that already.

    There should always be tests.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re: split versus =~
by LanX (Sage) on Nov 10, 2022 at 15:40 UTC
    FWIW: split is using a regex, hence you could also try a negative lookbehind assertion (like /(?<!^)-/ ) to make sure a '-' at string's start isn't used for splitting.

    Demo in debugger using perl -de0

    DB<4> x split /(?<!^)-/, "X-Y-Z" 0 'X' 1 'Y' 2 'Z' DB<5> x split /(?<!^)-/, "-X-Y-Z" 0 '-X' 1 'Y' 2 'Z' DB<6>

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

Re: split versus =~
by kcott (Archbishop) on Nov 10, 2022 at 20:20 UTC

    G'day russlo,

    Welcome to the Monastery.

    "My question is: why?"

    As others have pointed out, without any data, we can't really answer that. Here are a few possible reasons (non-exhaustive list):

    • You haven't anchored your regex. Your pattern could match anywhere in the string.
    • You're matching numbers that could be zero-length (\d*); a better choice would be \d+.
    • You say "X, Y, and Z are all numbers". Strictly speaking, you're matching (7-bit ASCII) digits; 1.23, 1e23, and so on are also numbers.

    For future reference, please provide a "Short, Self-Contained, Correct Example" and follow the guidelines in "How do I post a question effectively?".

    "Additionally: what can I do to provide the correct splitting that we're looking for here?"

    Comment your regex in full. By forcing yourself to document exactly what your regex does, you will more easily spot logic errors and typos. By writing your regex as I've done in the code below, it's very easy to make changes (e.g. at some future point perhaps Z can be negative or the string becomes "W-X-Y-Z"); fiddling around inside a regex which is jammed into a single string with no whitespace is highly error-prone.

    As others have already suggested, write a test script. In the code below, I added an "expect failure"; mostly to show you what that outputs. I also noted you mentioned a problem with '-2-3-4'; to be honest, I didn't follow what the problem was, but I added it for testing anyway. Add more tests if you encounter problem input that isn't handled by the regex; you may also need to alter the regex itself if it doesn't cover all eventualities.

    Note that with the way I've written the code, you can just add to @tests without needing to change any other part of the code.

    You should also provide some validation and error reporting. What happens if the input doesn't match the regex? — on-screen warning? logfile entry? kill the script?

    Here's my test script:

    #!/usr/bin/env perl use strict; use warnings; use constant { STR => 0, EXP => 1, }; use Test::More; my @tests = ( ['1-2-3', '123'], ['-1-2-3', '-123'], ['1--2-3', ''], ['1-2--3', ''], ['1--2--3', ''], ['-1--2-3', ''], ['-1-2--3', ''], ['-1--2--3', ''], ['1-2-', ''], ['-1-2-', ''], ['garbage', ''], ['expect', 'failure'], ['-2-3-4', '-234'], ); plan tests => 0+@tests; my $re = qr{(?x: ^ # start of string ( # start capture X -? # optional leading minus \d+ # 1 or more digits ) # end capture X - # required hyphen ( # start capture Y \d+ # 1 or more digits ) # end capture Y - # required hyphen ( # start capture Z \d+ # 1 or more digits ) # end capture Z $ # end of string )}; for my $test (@tests) { my ($X, $Y, $Z, $got) = ('') x 4; if (($X, $Y, $Z) = $test->[STR] =~ $re) { $got = "$X$Y$Z"; } ok($got eq $test->[EXP], "Testing '$test->[STR]' is " . (length $test->[EXP] ? 'GOOD' : 'BAD') ); }

    And here's the output:

    1..13 ok 1 - Testing '1-2-3' is GOOD ok 2 - Testing '-1-2-3' is GOOD ok 3 - Testing '1--2-3' is BAD ok 4 - Testing '1-2--3' is BAD ok 5 - Testing '1--2--3' is BAD ok 6 - Testing '-1--2-3' is BAD ok 7 - Testing '-1-2--3' is BAD ok 8 - Testing '-1--2--3' is BAD ok 9 - Testing '1-2-' is BAD ok 10 - Testing '-1-2-' is BAD ok 11 - Testing 'garbage' is BAD not ok 12 - Testing 'expect' is GOOD # Failed test 'Testing 'expect' is GOOD' # at ./pm_11148100_re_parse.pl line 55. ok 13 - Testing '-2-3-4' is GOOD # Looks like you failed 1 test of 13.

    — Ken

Re: split versus =~ (updated)
by LanX (Sage) on Nov 10, 2022 at 15:28 UTC
    > this has been found to not work exactly the same as the previous usage of split.

    in which cases?

    > My question is: why?

    Without data you keep us guessing...

    Anyway, here some thoughts:

    • I would always surround the anchors ^ and $ to make sure the whole string is matched.
    • change * to + to enforce at least one digit. (unless empty fields are possible)
    • check for success and log failing input
    update

    see also this alternative

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      Honestly, what happens to the data after this step is a bit of a mystery to me. $X, $Y, and $Z are passed through some impressively scientific mathematical translations that I simply do not understand nor do I have the time to. The code is converting from polar coordinates to rectangular coordinates, pushing those values into a vector, performing some math, and then stuffing everything into several references to global variables in various states. Calling it ugly is a disservice to ugly things. The people that wrote this are so much more intelligent than I am that they did not feel the need to comment the code.

      I think I was looking for an answer like "when you use split, it doesn't fibulate the $data into the cranjux, you just need to use a different delimiter is all." But life can't be simple.

      I'll try using the anchors. I had tried using + over *. If nothing else works I'll try using a different delimiter coming back from the database and then converting it back just after this step. This is a CGI script (written in 2016, so that should give you an indication of the mentality here). I don't fully understand the data, so logging is not quite as helpful as you might think, and printing to the page could cause the whole thing to fall over.

      Thanks anyways. Please know that my sarcasm was not directed at you.

        If nothing else works I'll try using a different delimiter coming back from the database and then converting it back just after this step.

        I think I would recommend doing that anyway, and doing it first. Something like a colon would be a good choice.

        This has the advantage that you can easily (I assume) test it with positive numbers to ensure it doesn't break the old use cases, and no adjustment should be needed to handle negative numbers.

        The fact that you need to "convert it back just after this step" implies that some other code will also need to extract the numbers again. That code also needs to be examined to see if it will cope with negative numbers, and my initial guess would be that it will not. I don't have enough information to guess if that code is later in the script that is under your responsibility, or in the code it hands off to.

        Unless there are compelling reasons not to, I would recommend changing to a different delimiter for the later parts as well.

        > Honestly, what happens to the data after this step is a bit of a mystery to me. $X, $Y, and $Z are passed through some impressively scientific mathematical translations that I simply do not understand nor do I have the time to

        Honestly this could also mean that this "impressive math" can't really handle negative numbers.

        You must find a way to test your code!

        If you can't tell when it fails, how can you even know it's buggy???

        > Thanks anyways. Please know that my sarcasm was not directed at you.

        Please know that my sarcasm is always directed at every direction ;-)

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: split versus =~
by tybalt89 (Monsignor) on Nov 10, 2022 at 18:44 UTC

    split version

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11148100 use warnings; for ( glob join '-', ('{-,}42') x 3 ) { my ($X, $Y, $Z) = split /(?<=\d)-/; printf "%11s => %3s %3s %3s\n", $_, $X, $Y, $Z; }

    Outputs:

    -42--42--42 => -42 -42 -42 -42--42-42 => -42 -42 42 -42-42--42 => -42 42 -42 -42-42-42 => -42 42 42 42--42--42 => 42 -42 -42 42--42-42 => 42 -42 42 42-42--42 => 42 42 -42 42-42-42 => 42 42 42
Re: split versus =~
by Fletch (Bishop) on Nov 10, 2022 at 15:32 UTC

    Rather than making people guess haphazardly it would be helpful to provide sample data that shows what your problem is and exactly what you mean by "not work(ing) exactly the same".

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: split versus =~
by russlo (Novice) on Nov 10, 2022 at 20:55 UTC
    This has been answered and a conclusion has been reached after my meeting with the responsible individuals took place. I hope in the future that I can provide a set of data to be looked at more than a set of data described. Thank you to all who responded here.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148100]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2023-03-23 19:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (60 votes). Check out past polls.

    Notices?