Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Force ASCII regexp for all perls 5.8+

by vsespb (Hermit)
on May 15, 2013 at 23:57 UTC ( #1033748=perlquestion: print w/ replies, xml ) Need Help??
vsespb has asked for the wisdom of the Perl Monks concerning the following question:

I am wondering is there an easier way to force perl treat \w and \d metacharacters as ASCII only, even for Unicode strings with UTF-8 bit set? Needs to work on all perls starting from 5.8.x
perl -e 'print 3+6 if "\x{424}" =~ /\w/' 9
Or the only possibility is to use [0-9] instead of \d etc.. ?

My code works with Unicode, but, at the same time, I wan't to be able to do validation/security checks and often (very often!) I need \d to match only 0-9 digits

/a /aa and use re '/a' modifier produce syntax errors on early versions of Perl

Comment on Force ASCII regexp for all perls 5.8+
Select or Download Code
Re: Force ASCII regexp for all perls 5.8+
by davido (Archbishop) on May 16, 2013 at 00:14 UTC

    See the /a modifier in perlre. Is that what you're after?

    If you want to impact an entire lexical scope, you can:

    use re '/a';

    Update: Oh good grief, how did I miss that this is for 5.8? It's only staring me in the face in the subject line, and at the end of the post. I have no excuse. Sorry. ;)


    Dave

Re: Force ASCII regexp for all perls 5.8+
by kcott (Abbot) on May 16, 2013 at 02:21 UTC

    G'day vsespb,

    You could set up something like this just once:

    my $d = qr{[0-9]}; my $w = qr{[A-Za-z0-9_]};

    And then just use $w and $d instead of \w and \d. That gives you the added benefit of still having \w and \d available if you need them. Here's a minimal test:

    $ perl -le ' my $w = qr{[A-Za-z0-9_]}; print q{With \x{424}:}; print 3+6 if "\x{424}" =~ /$w/; print q{With \x{42}:}; print 3+6 if "\x{42}" =~ /$w/; ' With \x{424}: With \x{42}: 9

    -- Ken

      Great idea!
Re: Force ASCII regexp for all perls 5.8+
by tobyink (Abbot) on May 16, 2013 at 09:17 UTC
    # First we'll define this pragma called Regexp::Ascii. This is where # the magic happens... # BEGIN { package Regexp::Ascii; $INC{'Regexp/Ascii.pm'} = __FILE__; use overload (); use Carp qw(croak carp); our %replace = ( '\d' => '[0-9]', '\D' => '[^0-9]', '\w' => '[A-Za-z0-9_]', '\W' => '[^A-Za-z0-9_]', ); sub import { shift; carp "Regexp::Ascii used with no parameters" unless @_; for (@_) { croak "Regexp::Ascii does not know what to do with '$_'" unless exists $replace{$_}; } my $find = join '|', map quotemeta, sort { length $b <=> length $a or $a cmp $b } @_; overload::constant 'qr' => sub { my $find = (my $re = shift) =~ s/($find)/$replace{$1}/eg; return $re; } } sub unimport { overload::remove_constant 'qr'; } }; # Here's our test case... # use strict; use warnings; use Test::More tests => 3; use constant ARABIC_INDIC_0 => "\x{0660}"; like(ARABIC_INDIC_0, qr/^\d$/, 'unicode'); { # use our pragma in this scope use Regexp::Ascii qw( \d \D ); unlike(ARABIC_INDIC_0, qr/^\d$/, 'ascii'); } like(ARABIC_INDIC_0, qr/^\d$/, 'unicode');
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      Thanks! Also I am thinking now - one needs not to forget to implement also "\D" (non-digit) "\W" (non-word) etc

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033748]
Approved by davido
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (10)
As of 2015-07-07 10:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (88 votes), past polls