davido has asked for the wisdom of the Perl Monks concerning the following question:

I'm sure there is a good historical reason why the character class \w (used in regular expressions) includes (under ASCII) both A..Z, a..z, and 0..9, and the _ underscore character. I understand that at this point, there's no going back to more narrowly define \w without breaking billions of lines of code already out there. But I have often wondered why \w was implemented this way in the first place..

Obviously if I want a character class that allows only alphabetical characters, that's easy to construct with "[a-zA-Z]". But that approach isn't as effective when programming with locales using Unicode. In that environment, \w automagically includes accented "word" characters. So it is difficult to construct a locale-portable character class representation of "word" characters that excludes numerics and underscore.

It seems to me that the \w character class definition is too broad. It would be easier to work with a more narrowly defined character class. For example, let's say there's a new character class called \a, which represents alpha characters only. If one wanted to create extend this imaginary character class of alpha characters to also include numeric characters, it would be sufficient to say, [\d\a]. Yet it's difficult to subtract items from predefined character classes. You can't say, [\w\D] if what you intend is "word characters minus numeric characters".

I'm curious as to why \w includes numeric and underscore characters. I'm also curious as to what would constitute a locale-friendly alpha-only character class (one that excludes numeric digits).

...seeking enlightenment...


Dave

Replies are listed 'Best First'.
Re: Why does \w include numbers and underscore?
by kvale (Monsignor) on Apr 22, 2004 at 02:34 UTC
    Historically, the awk and sed regex engines had a \w that supported underscores and perl was designed as a much improved version of these tools. Don't know the history before that, but I suspect it is derived from the idea that C identifiers can contain letters, numbers and underscores. You can use Unicode character classes for alpha matching:
    $x =~ /^\p{IsAlpha}/;
    See perlretut for more examples.

    -Mark

      Note that if you for some reason prefer POSIX names for these character classes, you can do this:
      $x =~ /^[:alpha:]/;
      More can be found in perlre.
Re: Why does \w include numbers and underscore?
by BrowserUk (Pope) on Apr 22, 2004 at 02:23 UTC

    Because they are the characters typically legal in variable identifiers.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: Why does \w include numbers and underscore?
by japhy (Canon) on Apr 22, 2004 at 02:47 UTC
    Once I patch Perl to make it Level 1 TR18-compliant, then you'll be able to write things like if ($str =~ /^[\w-\d]+$/) { ... }
    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Perl has supported [^\W\d] forever. One of my favorites is [^\S\n] (any whitespace character but newline).

      (update) I wasn't trying to argue against nor diminish your proposed patch. I just wanted to mention this solution to this problem which didn't appear to be covered in the thread yet.

      - tye        

        Yeah, but those are simple cases. I'm looking for scalability.
        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
      To see if I understand, qr/[a-z]/ is to be seen as class consisting of a range, but qr/[\w-\d]/ is to be seen as a class composed by the difference between sets?

      Would qr/[\d-89]/ find only octal digits or would it find the 9 too?

      --
      [ e d @ h a l l e y . c c ]

        If I understand TR18 correctly, it would match only octal digits. All the examples in TR18 show additional brackets or give whitespace to help make things clear.

        Union binds tighter than intersection, which binds tighter than subtraction, so [\d-89] is the same as [\d-[89]].

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Why does \w include numbers and underscore?
by Anonymous Monk on Apr 22, 2004 at 13:14 UTC

    A more nostalgic (though completely unfounded and undoubtedly erronous) explanation would be that in the _OLD_ days underscores held a more _prestigious_ place among their order. In this age of evolved markups the lowly emphasizer is but a pig in the stye of \w perls.

    Or maybe not:)

    - Andy