Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

match utf8

by glassel (Novice)
on Nov 12, 2012 at 13:36 UTC ( #1003444=perlquestion: print w/replies, xml ) Need Help??
glassel has asked for the wisdom of the Perl Monks concerning the following question:

In regular expressions, \w matches ordinary (e.g. ascii) word characters, not, however, utf8 multibyte characters. Is there a possibility to match the full class of utf8 codes?

Replies are listed 'Best First'.
Re: match utf8
by tobyink (Abbot) on Nov 12, 2012 at 13:54 UTC

    Unless you're using an ancient version of Perl, \w should match any Unicode word character. According to perlre there are over 100,000 characters it matches.

    use 5.010; use strict; use warnings; use utf8::all; my $string = "the café"; say "GOT: $1" if $string =~ /(\w{4})/;

    Make sure your strings are being interpreted as character strings rather than byte strings though. (See perlunicode and utf8.)

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      As shown here, locale can also influence the behaviour of qr/\w/. Using qr/\w/u should also help.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by gnork (Scribe) on Nov 12, 2012 at 13:54 UTC
    \p{Letter} is the corresponding UTF8 aware character class for \w

    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
Re: match utf8
by choroba (Chancellor) on Nov 12, 2012 at 13:43 UTC
    Can you give more information? What characters are you trying to match? Are you handling the encoding right?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by ikegami (Pope) on Nov 13, 2012 at 02:40 UTC
    None of them deal with UTF-8. The regex matching engine expects Unicode codepoints. Decode your input (e.g. using Encode's decode) first, then \w will work.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1003444]
Front-paged by Corion
[Lady_Aleena]: Hello everyone. I'm having a blonde moment. I can push an array to an array, right? push @to_array, @another_array;
[shmem]: of course
[shmem]: the members of @another_array are conflated to @array
[shmem]: try it out:
[Lady_Aleena]: shmem, thanks. I'm am way out of practice.
[shmem]: perl -lE '@foo = (0..3);@bar = (4..7); push @foo,@bar; say for @foo'
[Lady_Aleena]: shmem, I feel like an idiot for forgetting something so basic.
shmem puts a big cauldron of "silly con charme" on the table in the refectorium

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2017-04-27 11:29 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (503 votes). Check out past polls.