I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort. For the moment I am completely ignoring the new lines.
Additionally ignoring the unicode characters the following works:
/^[\w\ \-]+$/
Now adding in the unicode characters the following ought to work according to the docs:
/^[\w\ \-\X]+$/
However I actually get the following error:
Unrecognized escape \X in character class passed through in regex; mar
+ked by <-- HERE in m/^[\w\ \-\X <-- HERE ]+$/ at ./b.pl line 10.
Now the docs suggest a somewhat more complicated alternative to to \X which does sort of work:
/^[\w\ \-(?:\P{M}\p{M}+)]+$/
This works in the sense that it will accept unicode characters, but however it will actually accept just about anything. I have tried ways round this such as using zero width assertions and other possibilities opened up be reading about unicode properties. However everything I have tried either seems to reject unicode or allow everything.
I have encapsulated the behaviour in the following scriptlet:
#!/usr/bin/perl
use strict;
use warnings;
while(<>) {
my $line = $_;
if ($line =~ /^[\w\ \-]+$/) {
print "STRAIGHT - ";
}
if ($line =~ /^[\w\ \-(?:\P{M}\p{M}+)]+$/) {
print "OKAY\n";
}
else {
print "BLAH!\n";
}
}
.
My standard example of a unicode word is "księgowość".
When I run "perl -V" I get the following:
Summary of my perl5 (revision 5 version 10 subversion 0) configuration
+:
Platform:
osname=linux, osvers=2.6.26-2-amd64, archname=i486-linux-gnu-threa
+d-multi
uname='linux puccini 2.6.26-2-amd64 #1 smp fri aug 14 07:12:04 utc
+ 2009 i686 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dccc
+dlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/
+share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dve
+ndorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/us
+r/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/
+lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/ma
+n/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man
+/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Ua
+fs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2
+ -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=und
+ef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict
+-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFS
+ET_BITS=64',
optimize='-O2 -g',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
+ -pipe -I/usr/local/include'
ccversion='', gccversion='4.3.2', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1
+2
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
+ lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /usr/lib64
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.
+5.10.0
gnulibc_version='2.7'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'
Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_ITH
+READS
USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API
Built under linux
Compiled at Aug 28 2009 22:15:29
@INC:
/etc/perl
/usr/local/lib/perl/5.10.0
/usr/local/share/perl/5.10.0
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.10
/usr/share/perl/5.10
/usr/local/lib/site_perl
.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.