Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

match utf8

by glassel (Novice)
on Nov 12, 2012 at 13:36 UTC ( #1003444=perlquestion: print w/replies, xml ) Need Help??
glassel has asked for the wisdom of the Perl Monks concerning the following question:

In regular expressions, \w matches ordinary (e.g. ascii) word characters, not, however, utf8 multibyte characters. Is there a possibility to match the full class of utf8 codes?

Replies are listed 'Best First'.
Re: match utf8
by tobyink (Abbot) on Nov 12, 2012 at 13:54 UTC

    Unless you're using an ancient version of Perl, \w should match any Unicode word character. According to perlre there are over 100,000 characters it matches.

    use 5.010; use strict; use warnings; use utf8::all; my $string = "the café"; say "GOT: $1" if $string =~ /(\w{4})/;

    Make sure your strings are being interpreted as character strings rather than byte strings though. (See perlunicode and utf8.)

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      As shown here, locale can also influence the behaviour of qr/\w/. Using qr/\w/u should also help.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by gnork (Scribe) on Nov 12, 2012 at 13:54 UTC
    \p{Letter} is the corresponding UTF8 aware character class for \w

    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
Re: match utf8
by choroba (Bishop) on Nov 12, 2012 at 13:43 UTC
    Can you give more information? What characters are you trying to match? Are you handling the encoding right?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by ikegami (Pope) on Nov 13, 2012 at 02:40 UTC
    None of them deal with UTF-8. The regex matching engine expects Unicode codepoints. Decode your input (e.g. using Encode's decode) first, then \w will work.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1003444]
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2018-07-16 20:45 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (349 votes). Check out past polls.