http://www.perlmonks.org?node_id=496914

jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I thought I understood regexp but today this problem proved me wrong :(
I need to split a string, something like: abcd2 or abc. Meaning that the length is unknown and if a number is at the end!
I like to put the characters and the number in variables, like:
$str = "abdbdr23" ($name,$num) = ($str =~ /^(\w+)(\d{0,}$/) ;

Any suggestions why this is not working ?

Thanks
Luca

Replies are listed 'Best First'.
Re: Simple regular expression problem
by polypompholyx (Chaplain) on Oct 03, 2005 at 13:59 UTC

    Apart from the typos, the reason is that \w+ will gobble up all the word characters in $str, which includes the number at the end. Since your match specifies 'zero or more' numbers at then end, $num gets an empty string. You need to modify the regex to make the \w+ non-greedy, using the ? modifier:

    my $str = "abdbdr23"; my ( $name, $num ) = ( $str =~ /^(\w+?)(\d*)$/ );

    * is a much less ghastly way of writing {0,}.

      It's better to avoid the ? modifier in most cases, as it's less efficient as alternatives. Here's a benchmark:
      #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; use Test::More tests => 2; our @data = qw 'foo123 abdbdr23 abcd2 abc 1234 foo!123'; our (@plain, @sticky); my @expected = ([qw 'foo 123'], [qw 'abdbdr 23'], [qw 'abcd 2'], ['abc', ''], ['', 1234], []); cmpthese -1, { plain => '@plain = map {[/^([a-z]*)(\d*)$/]} @data', sticky => '@sticky = map {[/^(\w*?)(\d*)$/]} @data', }; is_deeply \@plain, \@expected; is_deeply \@sticky, \@expected; __END__ 1..2 Rate sticky plain sticky 32582/s -- -17% plain 39385/s 21% -- ok 1 ok 2
      Perl --((8:>*

        Benchmarking is fun. However you should consider your results a little more carefully before making recomendation on them. This would definitly count as a minor optimization at best since we are talking about 32k instead of 40k per second. Which means unless you are are doing 100k's of these compares you are never going to notice the difference. Also interesting is the result of that benchmark on my machine:

        1..2 Rate plain sticky plain 23682/s -- -1% sticky 23904/s 1% -- ok 1 ok 2

        Oddly the difference dropped to mere 100s per second.


        ___________
        Eric Hodges
        The OP didn't make it clear whether the string before the number could contain digits. However, it's certainly better to be specific in a regex: if you know (for some value of 'know') something will only contain [A-Za-z], not \w, then the former is probably preferable. On the other hand, [A-Za-z] too often it means "I cannot think of any other letters", and then your script barfs on something perfectly valid, but unexpected, like "Ångström".
        thanx for all the suggestions. I fixed it with \w+? or maybe I use the alpha example!! And now that I understand my mistake, I see that it was all the time already described in the perldoc manual!!

        All your replies are really helpful,
        Thanks a lot!!
        Luca

      That's won't work if string contains no digits (which the OP said was a possible input). For example, $name will be just "a" for "abdbdr".

      Update: Just plain wrong.

        Therefore he anchored the regex with $ to match until the end regardless if a digit is present or not
Re: Simple regular expression problem
by prasadbabu (Prior) on Oct 03, 2005 at 13:49 UTC

    \w will match alphanumeric characters [0-9a-zA-Z_]. So in your regex \w matches digit also. So change it as shown.

    Also you are missing a parantheses in second grouping.

    $str = "abdbdr23"; ($name,$num) = ($str =~ /^([a-zA-Z]+)(\d{0,})$/) ; print "$name\t$num\n";

    Prasad

      <shudder> That works fine in English. But not so good in pretty much any other language. e.g., accented characters and the like, or non-roman languages such as Arabic, Hebrew, Hindi, or pretty much any Asian language. Ok, maybe today you don't support them, but maybe tomorrow? Besides that, this regexp is not self-documenting if you mean to say you want to match "letters". Better to use the POSIX classes documented in perlre:

      ($name,$num) = $str =~ /^([[:alpha:]]+)(\d*)$/;
      This does a full unicode match against "alphabet". Which has a very well-defined and globalised meaning.

      I'm also unsure why you use "{0,}" - this has precisely the same meaning as "*". Especially when you used "+" instead of "{1,}". Over everything else, be consistant!

Re: Simple regular expression problem
by muba (Priest) on Oct 03, 2005 at 13:56 UTC
    ($name,$num) = ($str =~ /^(\w+)(\d{0,}$/) ;
    As for what I can see, you forgot something:
    ($name, $num) = # assign string parts to variables ($str =~ # we gonna do regexes! /^ # beginning of the string ( # begin of group \w+ # \w a couple'o times ) # end of group ( # begin of group \d{0,} # \d a couple'o times # why not just \d* ? $ # end of string / # backslash after end of string ) # end of group ; # a semicolon after end of string # unexpected end of line?
    I challenge you to find the mistake :)
    Update: and also see the first reply :)
Re: Simple regular expression problem
by sauoq (Abbot) on Oct 03, 2005 at 17:42 UTC
    I need to split a string

    Usually when a person says that, he really wants to use split. I don't see why this would be an exception...

    ($name, $num) = split /(?=\d+$)/, $str, 2;
    This avoids all the uproar about about non-ascii characters and meets your specification exactly in that it makes no assumptions about the string prior to the digits at the end. It uses a zero-width look-ahead assertion to split without losing characters and it uses the 3 argument version of split to limit our split to two parts (otherwise a string like "abc123" would split into 4 parts.)

    -sauoq
    "My two cents aren't worth a dime.";