Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

match utf8

by glassel (Novice)
on Nov 12, 2012 at 13:36 UTC ( [id://1003444]=perlquestion: print w/replies, xml ) Need Help??

glassel has asked for the wisdom of the Perl Monks concerning the following question:

In regular expressions, \w matches ordinary (e.g. ascii) word characters, not, however, utf8 multibyte characters. Is there a possibility to match the full class of utf8 codes?

Replies are listed 'Best First'.
Re: match utf8
by tobyink (Canon) on Nov 12, 2012 at 13:54 UTC

    Unless you're using an ancient version of Perl, \w should match any Unicode word character. According to perlre there are over 100,000 characters it matches.

    use 5.010; use strict; use warnings; use utf8::all; my $string = "the café"; say "GOT: $1" if $string =~ /(\w{4})/;

    Make sure your strings are being interpreted as character strings rather than byte strings though. (See perlunicode and utf8.)

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      As shown here, locale can also influence the behaviour of qr/\w/. Using qr/\w/u should also help.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by gnork (Scribe) on Nov 12, 2012 at 13:54 UTC
    \p{Letter} is the corresponding UTF8 aware character class for \w


    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
    errors->(c)
Re: match utf8
by choroba (Cardinal) on Nov 12, 2012 at 13:43 UTC
    Can you give more information? What characters are you trying to match? Are you handling the encoding right?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by ikegami (Patriarch) on Nov 13, 2012 at 02:40 UTC
    None of them deal with UTF-8. The regex matching engine expects Unicode codepoints. Decode your input (e.g. using Encode's decode) first, then \w will work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1003444]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-18 06:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found