vsespb has asked for the wisdom of the Perl Monks concerning the following question:

I am wondering is there an easier way to force perl treat \w and \d metacharacters as ASCII only, even for Unicode strings with UTF-8 bit set? Needs to work on all perls starting from 5.8.x
perl -e 'print 3+6 if "\x{424}" =~ /\w/' 9
Or the only possibility is to use [0-9] instead of \d etc.. ?

My code works with Unicode, but, at the same time, I wan't to be able to do validation/security checks and often (very often!) I need \d to match only 0-9 digits

/a /aa and use re '/a' modifier produce syntax errors on early versions of Perl

Replies are listed 'Best First'.
Re: Force ASCII regexp for all perls 5.8+
by kcott (Chancellor) on May 16, 2013 at 02:21 UTC

    G'day vsespb,

    You could set up something like this just once:

    my $d = qr{[0-9]}; my $w = qr{[A-Za-z0-9_]};

    And then just use $w and $d instead of \w and \d. That gives you the added benefit of still having \w and \d available if you need them. Here's a minimal test:

    $ perl -le ' my $w = qr{[A-Za-z0-9_]}; print q{With \x{424}:}; print 3+6 if "\x{424}" =~ /$w/; print q{With \x{42}:}; print 3+6 if "\x{42}" =~ /$w/; ' With \x{424}: With \x{42}: 9

    -- Ken

      Great idea!
Re: Force ASCII regexp for all perls 5.8+
by tobyink (Abbot) on May 16, 2013 at 09:17 UTC
    # First we'll define this pragma called Regexp::Ascii. This is where # the magic happens... # BEGIN { package Regexp::Ascii; $INC{'Regexp/'} = __FILE__; use overload (); use Carp qw(croak carp); our %replace = ( '\d' => '[0-9]', '\D' => '[^0-9]', '\w' => '[A-Za-z0-9_]', '\W' => '[^A-Za-z0-9_]', ); sub import { shift; carp "Regexp::Ascii used with no parameters" unless @_; for (@_) { croak "Regexp::Ascii does not know what to do with '$_'" unless exists $replace{$_}; } my $find = join '|', map quotemeta, sort { length $b <=> length $a or $a cmp $b } @_; overload::constant 'qr' => sub { my $find = (my $re = shift) =~ s/($find)/$replace{$1}/eg; return $re; } } sub unimport { overload::remove_constant 'qr'; } }; # Here's our test case... # use strict; use warnings; use Test::More tests => 3; use constant ARABIC_INDIC_0 => "\x{0660}"; like(ARABIC_INDIC_0, qr/^\d$/, 'unicode'); { # use our pragma in this scope use Regexp::Ascii qw( \d \D ); unlike(ARABIC_INDIC_0, qr/^\d$/, 'ascii'); } like(ARABIC_INDIC_0, qr/^\d$/, 'unicode');
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      Thanks! Also I am thinking now - one needs not to forget to implement also "\D" (non-digit) "\W" (non-word) etc
Re: Force ASCII regexp for all perls 5.8+
by davido (Cardinal) on May 16, 2013 at 00:14 UTC

    See the /a modifier in perlre. Is that what you're after?

    If you want to impact an entire lexical scope, you can:

    use re '/a';

    Update: Oh good grief, how did I miss that this is for 5.8? It's only staring me in the face in the subject line, and at the end of the post. I have no excuse. Sorry. ;)