Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: About \d \w and \s

by graff (Chancellor)
on Oct 19, 2009 at 06:43 UTC ( #801937=note: print w/replies, xml ) Need Help??

in reply to About \d \w and \s

For you out there working in non-english/latin how much do you depend on \d matching your native digits?

I suspect I've done this a few times at least, w.r.t the "Arabic-Indic" digits (U+0660 - U+0669 and even U+06F0 - U+06F9); some Arabic typists have a pernicious tendency to use these as well as ASCII digits in a single document.

Folks working in Chinese/Japanese/Korean often see the "full-width digits" (U+FF10 - U+FF19) -- and these can also show up in the same document with ASCII digits.

I'm all for greater consistency in regex semantics, but in that regard, it strikes me as very odd (and probably unfortunate) that /\d/u would not be equivalent to /\p{IsDigit}/, in contrast to what /u does for the \s and \w escapes. (What about "\b", by the way?)

I think it's usually the case, when doing regex matching on non-Latin text, that the primary task is to segregate text into functional (linguistic) categories: word strings vs. digit strings vs. punctuation strings vs... Once that's done, we might want to do different things with the different chunks (like normalizing digit strings).

If I happen to be working with non-Latin data that uses mixed digits, I think I'd rather error out on finding that some "/\d+/u" strings are not suitable for doing arithmetic, rather than never finding out that I'm missing the non-ASCII digit strings altogether because they didn't match "/\d+/u".

If there's no "/u" modifier, and I always have to use perlunicode escapes in regexes in order to match unicode character class equivalents of \s \w \d, okay fine, I'll use \pZ [\pL\pM] \pN (or \p{IsSpace} \p{IsWord} \p{IsDigit} if I sense a need to code verbosely).

But if there's going to be a "/u" modifier, I think it would be more consistent (less surprising/annoying) to have it treat \d the same as \w and \s (and \b, for that matter), especially since "\d" is normally understood to be a subset of \w, and with a /u modifier, \w would include non-ASCII digits.

If someone has to face the task of doing arithmetic on potentially mixed digit strings, it won't be long before we have a CPAN module for this (maybe there's one already?), and testing for non-ASCII digits would be pretty simple:

if ( /^\d+$/u and not /^\d+$/ ) { # need to normalize this non-ASCII digit string before doing arith +metic... }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://801937]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2018-01-23 01:04 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (238 votes). Check out past polls.