Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
For you out there working in non-english/latin how much do you depend on \d matching your native digits?

I suspect I've done this a few times at least, w.r.t the "Arabic-Indic" digits (U+0660 - U+0669 and even U+06F0 - U+06F9); some Arabic typists have a pernicious tendency to use these as well as ASCII digits in a single document.

Folks working in Chinese/Japanese/Korean often see the "full-width digits" (U+FF10 - U+FF19) -- and these can also show up in the same document with ASCII digits.

I'm all for greater consistency in regex semantics, but in that regard, it strikes me as very odd (and probably unfortunate) that /\d/u would not be equivalent to /\p{IsDigit}/, in contrast to what /u does for the \s and \w escapes. (What about "\b", by the way?)

I think it's usually the case, when doing regex matching on non-Latin text, that the primary task is to segregate text into functional (linguistic) categories: word strings vs. digit strings vs. punctuation strings vs... Once that's done, we might want to do different things with the different chunks (like normalizing digit strings).

If I happen to be working with non-Latin data that uses mixed digits, I think I'd rather error out on finding that some "/\d+/u" strings are not suitable for doing arithmetic, rather than never finding out that I'm missing the non-ASCII digit strings altogether because they didn't match "/\d+/u".

If there's no "/u" modifier, and I always have to use perlunicode escapes in regexes in order to match unicode character class equivalents of \s \w \d, okay fine, I'll use \pZ [\pL\pM] \pN (or \p{IsSpace} \p{IsWord} \p{IsDigit} if I sense a need to code verbosely).

But if there's going to be a "/u" modifier, I think it would be more consistent (less surprising/annoying) to have it treat \d the same as \w and \s (and \b, for that matter), especially since "\d" is normally understood to be a subset of \w, and with a /u modifier, \w would include non-ASCII digits.

If someone has to face the task of doing arithmetic on potentially mixed digit strings, it won't be long before we have a CPAN module for this (maybe there's one already?), and testing for non-ASCII digits would be pretty simple:

if ( /^\d+$/u and not /^\d+$/ ) { # need to normalize this non-ASCII digit string before doing arith +metic... }

In reply to Re: About \d \w and \s by graff
in thread About \d \w and \s by demerphq

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (4)
    As of 2017-10-20 20:06 GMT
    Find Nodes?
      Voting Booth?
      My fridge is mostly full of:

      Results (266 votes). Check out past polls.