(dchetlin: beware the unicode beast) Re(2): Number?

in reply to Re: Number?
in thread Number?

Not so fast! Believe it or not, \d matches 178 characters in utf8. [1]

So, while your second two solutions are equivalent, the first one is only correct if you know you're dealing with ASCII.

[1]: Just in case you don't believe me :-)

[~] $ perl -anle'next if/[^\s\da-f]/i;$t+=hex($F[1])-hex($F[0]);
                 END{print$t}' bleadperl/lib/unicode/Is/Digit.pl 
178
[download]

</pedantic>

-dlc

Comment on (dchetlin: beware the unicode beast) Re(2): Number? Download Code

Replies are listed 'Best First'.
RE: (dchetlin: beware the unicode beast) Re(2): Number? by Fastolfe (Vicar) on Oct 25, 2000 at 19:13 UTC
I would /expect/ that the regexp `\d` would match numbers, and numbers only. If 178 items in Unicode match, why? Are these just different ways of representing numbers in other languages perhaps? Perhaps things like 1/2 or 1/4? Or are you generating Unicode that is overly long, with different Unicode representations for the same numbers? Does anyone have any explanation for why this would be the case? Does it even violate the assumption that `\d` matches numbers?	[reply] [d/l] [select]
RE: RE: (dchetlin: beware the unicode beast) Re(2): Number? by dchetlin (Friar) on Oct 25, 2000 at 19:45 UTC
You're exactly right that they're different ways of representing numbers in other languages. If you'd like to see an example of what such a set of numbers might look like, try here (chosen at random). The digits are in the 5th column from the left, labelled 104, in rows 0 through 9. Whether or not having `\d` match 178 different characters is a good thing depends on the situation. I've been treating the Unicode situation somewhat similar to Y2K -- it's overhyped, but you still need to worry a bit. Any code that might at some point need to be internationalized should be thought through, and idioms like `tr/0-9//c` discarded. Of course, I don't turn `utf8` on yet, because the support for Unicode is still immature and shaky, and I'd hate to have a random string be validated as a number just because it contained two bytes next to each other that happened to be `0x1048`. Line disciplines will solve that, eventually. In sum: I would certainly urge Monks to be early adopters, or at least stay aware of Unicode issues, if for no other reason than to avoid subtle bugs in the future. -dlc	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom