Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Can I change \s?

by pileofrogs (Priest)
on Oct 28, 2011 at 20:40 UTC ( #934506=perlquestion: print w/replies, xml ) Need Help??
pileofrogs has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to change what '\s' means? It's not matching the non-breaking-space (chr(160)) that I found in this file I have to parse and I'd like it to.

I know, I could just edit my regex to say [chr(160)\s] or something like that, but that's not as fun. And it wouldn't change the behavior of any regexen I didn't manually edit, like ones in someone else's module, for instance.

Said another way, I'd like

my $string = ' '.chr(160).' '; if ( $string =~ /^\s*$/ ) { print "Happy Happy!"; }

... to print "Happy Happy!"



Replies are listed 'Best First'.
Re: Can I change \s?
by ikegami (Pope) on Oct 28, 2011 at 21:57 UTC

    \s is suppose to match U+00A0 and sometimes does. It's a bug in Perl that cannot be fixed for backwards compatibility reasons. You can indicate you want the fix using feature unicode_strings. In earlier versions of Perl, you can also work around the bug by upgrading the string's internal format.

    use strict; use warnings; use feature qw( say ); my $string = chr(160); no feature qw( unicode_strings ); say $string =~ /^\s*$/ ? 1 : 0 ; # 0 use feature qw( unicode_strings ); say $string =~ /^\s*$/ ? 1 : 0; # 1 no feature qw( unicode_strings ); utf8::upgrade($string); say $string =~ /^\s*$/ ? 1 : 0; # 1
      Note that the feature unicode_strings was introduced in 5.12, but to make \s match \xA0 on a non-utf-8 strings, you need 5.14:
      $ perl-5.14.2 -wE '$_ = chr 0xA0; say /\s/ || 0' 1 $ perl-5.12.2 -wE '$_ = chr 0xA0; say /\s/ || 0' 0
      The -E enables "use feature 'unicode_strings'".
Re: Can I change \s?
by JavaFan (Canon) on Oct 28, 2011 at 21:27 UTC
    It might be easier to upgrade your perl to 5.14:
    $ perl-5.14.2 -wE 'say "\xA0" =~ /\s/' 1 $ perl-5.14.2 -wle 'print "\xA0" =~ /\s/u' 1 $ perl-5.12.2 -wE 'say "\xA0" =~ /\s/' $
    Alternatively, upgrade your string to UTF-8:
    $ perl -wle '$_ = chr 0xA0; utf8::upgrade($_); print /\s/' 1 $
Re: Can I change \s?
by runrig (Abbot) on Oct 28, 2011 at 20:55 UTC
    You can:
    my $sp_str = "[\\s".chr(160)."]"; my $sp = qr/$sp_str/; my $str = " ".chr(160)." "; print "Happy\n" if $str =~ /^$sp*$/;
      Have a look at perldoc perlrecharclass (Perl Regular Expression Character Classes). If your string is in UTF-8 format, then \s will match all of:
      0x00009 CHARACTER TABULATION h s 0x0000a LINE FEED (LF) vs 0x0000b LINE TABULATION v 0x0000c FORM FEED (FF) vs 0x0000d CARRIAGE RETURN (CR) vs 0x00020 SPACE h s 0x00085 NEXT LINE (NEL) vs 0x000a0 NO-BREAK SPACE h s 0x01680 OGHAM SPACE MARK h s 0x0180e MONGOLIAN VOWEL SEPARATOR h s 0x02000 EN QUAD h s 0x02001 EM QUAD h s 0x02002 EN SPACE h s 0x02003 EM SPACE h s 0x02004 THREE-PER-EM SPACE h s 0x02005 FOUR-PER-EM SPACE h s 0x02006 SIX-PER-EM SPACE h s 0x02007 FIGURE SPACE h s 0x02008 PUNCTUATION SPACE h s 0x02009 THIN SPACE h s 0x0200a HAIR SPACE h s 0x02028 LINE SEPARATOR vs 0x02029 PARAGRAPH SEPARATOR vs 0x0202f NARROW NO-BREAK SPACE h s 0x0205f MEDIUM MATHEMATICAL SPACE h s 0x03000 IDEOGRAPHIC SPACE h s


      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://934506]
Approved by davies
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2017-01-23 11:21 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (192 votes). Check out past polls.