Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Unicode regular expressions

by SilasTheMonk (Chaplain)
on Dec 05, 2009 at 16:14 UTC ( #811249=perlquestion: print w/ replies, xml ) Need Help??
SilasTheMonk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort. For the moment I am completely ignoring the new lines.

Additionally ignoring the unicode characters the following works:

/^[\w\ \-]+$/
Now adding in the unicode characters the following ought to work according to the docs:
/^[\w\ \-\X]+$/
However I actually get the following error:
Unrecognized escape \X in character class passed through in regex; mar +ked by <-- HERE in m/^[\w\ \-\X <-- HERE ]+$/ at ./b.pl line 10.

Now the docs suggest a somewhat more complicated alternative to to \X which does sort of work:

/^[\w\ \-(?:\P{M}\p{M}+)]+$/
This works in the sense that it will accept unicode characters, but however it will actually accept just about anything. I have tried ways round this such as using zero width assertions and other possibilities opened up be reading about unicode properties. However everything I have tried either seems to reject unicode or allow everything.

I have encapsulated the behaviour in the following scriptlet:

#!/usr/bin/perl use strict; use warnings; while(<>) { my $line = $_; if ($line =~ /^[\w\ \-]+$/) { print "STRAIGHT - "; } if ($line =~ /^[\w\ \-(?:\P{M}\p{M}+)]+$/) { print "OKAY\n"; } else { print "BLAH!\n"; } }
. My standard example of a unicode word is "księgowość".

When I run "perl -V" I get the following:

Summary of my perl5 (revision 5 version 10 subversion 0) configuration +: Platform: osname=linux, osvers=2.6.26-2-amd64, archname=i486-linux-gnu-threa +d-multi uname='linux puccini 2.6.26-2-amd64 #1 smp fri aug 14 07:12:04 utc + 2009 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dccc +dlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/ +share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dve +ndorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/us +r/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/ +lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/ma +n/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man +/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Ua +fs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 + -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=und +ef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict +-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFS +ET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing + -pipe -I/usr/local/include' ccversion='', gccversion='4.3.2', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +2 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /usr/lib64 libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so. +5.10.0 gnulibc_version='2.7' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_ITH +READS USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API Built under linux Compiled at Aug 28 2009 22:15:29 @INC: /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .

Comment on Unicode regular expressions
Select or Download Code
Re: Unicode regular expressions
by ikegami (Pope) on Dec 05, 2009 at 16:20 UTC
    \X isn't a class. It can match multiple characters.
    /^(?:\X|[\w -])+$/

    \X matches the space, the dash and everything in \w, so the above simplifies to

    /^\X+$/
Re: Unicode regular expressions
by Anonymous Monk on Dec 05, 2009 at 16:29 UTC
Re: Unicode regular expressions (Decomposed)
by ikegami (Pope) on Dec 05, 2009 at 17:02 UTC

    I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort.

    use charnames qw( :full ); my $s = "ksi\N{LATIN SMALL LETTER E WITH OGONEK}" . "gowos\N{LATIN SMALL LETTER S WITH ACUTE}" . "c\N{LATIN SMALL LETTER C WITH ACUTE}"; print $s =~ /^(?:\r\n|[\p{Alnum} \n-])*\z/ ? "match\n" : "no match\n";
    match

    What does \X have to do with that? Is it that the string is (at least partially) decomposed?

    use charnames qw( :full ); my $s = "ksie\N{COMBINING OGONEK}gowo" . "s\N{COMBINING ACUTE ACCENT}" . "c\N{COMBINING ACUTE ACCENT}"; print $s =~ /^(?:\r\n|[\p{Alnum} \n-])*\z/ ? "match\n" : "no match\n";
    match

    But that also matches. (ok, that surprised me)


    On decomposed characters,

    For any one who doesn't know, some of what you perceive as a character can actually be represented by multiple combinations of Unicode characters. Take "é", for example. It can be made of the character "é" or by the character "e" followed by combining acute accent character (U+0301). Here are two forms for the string provided by the OP (fully composed and fully decomposed):

    use Unicode::Normalize qw( normalize ); use charnames qw( ); my $s = "ksi\x{0119}gowo\x{015B}\x{0107}"; for (qw(NFC NFD)) { print "$_\n"; printf("U+%04X: %s\n", $_, charnames::viacode($_)) for map ord, split //, normalize($_, $s); print("\n"); }
    NFC U+006B: LATIN SMALL LETTER K U+0073: LATIN SMALL LETTER S U+0069: LATIN SMALL LETTER I U+0119: LATIN SMALL LETTER E WITH OGONEK U+0067: LATIN SMALL LETTER G U+006F: LATIN SMALL LETTER O U+0077: LATIN SMALL LETTER W U+006F: LATIN SMALL LETTER O U+015B: LATIN SMALL LETTER S WITH ACUTE U+0107: LATIN SMALL LETTER C WITH ACUTE NFD U+006B: LATIN SMALL LETTER K U+0073: LATIN SMALL LETTER S U+0069: LATIN SMALL LETTER I U+0065: LATIN SMALL LETTER E U+0328: COMBINING OGONEK U+0067: LATIN SMALL LETTER G U+006F: LATIN SMALL LETTER O U+0077: LATIN SMALL LETTER W U+006F: LATIN SMALL LETTER O U+0073: LATIN SMALL LETTER S U+0301: COMBINING ACUTE ACCENT U+0063: LATIN SMALL LETTER C U+0301: COMBINING ACUTE ACCENT

    \X is used to match a "visual character". Back to our example, Both

    "\N{LATIN SMALL LETTER E WITH ACUTE}" =~ /^\X\z/
    and
    "e\N{COMBINING ACUTE}" =~ /^\X\z/
    will match.

    (By the way \X doesn't match everything it should. This will be fixed in 5.12.1.)

Re: Unicode regular expressions
by AnomalousMonk (Monsignor) on Dec 05, 2009 at 23:59 UTC
    /^[\w\ \-(?:\P{M}\p{M}+)]+$/

    Also note that in a character set, the grouping and quantifier metacharacters have no meta-meanings, so the character set above explicitly includes the  ) + : ? ( characters, some of which are punctuation.

Re: Unicode regular expressions
by eye (Chaplain) on Dec 06, 2009 at 09:10 UTC
    This is small, but no one has mentioned it. The OP wrote:

    Additionally ignoring the unicode characters the following works:
    /^[\w\ \-]+$/
    Because of the "\w", this will match underscores (_); I don't think that is what is intended by:
    ...regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort.
Re: Unicode regular expressions
by JavaFan (Canon) on Dec 07, 2009 at 11:02 UTC
    I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort.
    That's fairly trivial, once you know what you want to match, and what you don't want to. You say "hyphen" but "no punctuation of any sort". But a hyphen is punctuation of some sort. And in Unicode, there are many kinds of dashes. And what do you mean by "punctuation"? Do you consider a WHITE FROWNING FACE to be punctuation? What about a SNOWMAN? As for 'letters', Unicode defines what it considers 'letters'. Does that match your idea of letters? And numbers, do you mean digits? Anything numerical? And what are "spaces" in your definition? All 20+ spaces in the Unicode standard? Probably not, because that includes all the various linelines, and you mention them explicitly.

    In short, your definition of what you want to match and what you don't is too vague to do anything with. And once it's exact, writing the regexp is easy.

      I have had to deprioritize this particular projectfor now but the answers so far contain a lot of useful information and experience which I will need to study. The main point is that people are picking up on my choice of requirements. If they are vague that might be a good thing, seeing as each interpretation of my requirements might elicit more useful information. However I can clarify. My test was rarely that the regular expression should accept "księgowość" but reject "$%%^&". I was surprised at how hard this was. More generally I was hoping the regular expression would capture "reasonable search terms". As such I would regard a Chinese sentence as valid but an emoticon character as invalid.
        Oh, you want to recognize words. You know, you don't have to leave the ASCII realm to realize that that is more tricky than just matching letters and not matching punctuation symbols. Not matching punctuation symbols means rejecting "don't" as a word.

        As for matching Unicode letters, we have:

            "ญᴥ一ךى" =~ /^\p{L}+$/
        
        which is a sequence of (Unicode) letters, but from 5 different scripts. Do you want to match that?

        And then I haven't touch the can of worms called 'combining sequences'. Many (all?) of the accented Unicode characters can also be formed by taking the base character, and adding the various decorations to them. Not to mention that most combinations of a base character and decorations don't have a Unicode code point, and will have to be made by combining sequences.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://811249]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (13)
As of 2014-07-22 16:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (119 votes), past polls