by glassel (Novice)
on Nov 12, 2012
glassel has asked for the wisdom of the Perl Monks concerning the following question:

In regular expressions, \w matches ordinary (e.g. ascii) word characters, not, however, utf8 multibyte characters. Is there a possibility to match the full class of utf8 codes?

Re: match utf8
by tobyink (Abbot) on Nov 12, 2012

    Unless you're using an ancient version of Perl, \w should match any Unicode word character. According to perlre there are over 100,000 characters it matches.

    use 5.010; use strict; use warnings; use utf8::all; my $string = "the café"; say "GOT: $1" if $string =~ /(\w{4})/;

    Make sure your strings are being interpreted as character strings rather than byte strings though. (See perlunicode and utf8.)

      As shown here, locale can also influence the behaviour of qr/\w/. Using qr/\w/u should also help.
Re: match utf8
by gnork (Scribe) on Nov 12, 2012
    \p{Letter} is the corresponding UTF8 aware character class for \w

Re: match utf8
by choroba (Bishop) on Nov 12, 2012
    Can you give more information? What characters are you trying to match? Are you handling the encoding right?
Re: match utf8
by ikegami (Pope) on Nov 13, 2012
    None of them deal with UTF-8. The regex matching engine expects Unicode codepoints. Decode your input (e.g. using Encode's decode) first, then \w will work.

