Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Remove unicode "whitespace"

by 7stud (Deacon)
on Feb 28, 2013 at 03:36 UTC ( #1020983=note: print w/replies, xml ) Need Help??

in reply to Remove unicode "whitespace"

I have a number of strings which terminate in the unicode character E2 80 8E.

There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some *unicode integer* (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM.

this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE.

Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings.

The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character:

use strict; use warnings; use 5.012; say hex("200E"); #8206 my $str = "hello\N{LRM}"; if ($str =~ / hello ( #Start of $1 \p{FORMAT} #One char in Unicode FORMAT category ) #End of $1 /xms) { #Standard flags say ord($1); #8206 }
Here's a list of the 139 characters in the FORMAT category.

Replies are listed 'Best First'.
Re^2: Remove unicode "whitespace"
by HYanWong (Acolyte) on Feb 28, 2013 at 11:20 UTC

    Great. Thanks for the tip about \p{FORMAT}, and the correction about Unicode terminology. I'll try stripping my strings using /[\s\p{FORMAT}]*$//g then.

Re^2: Remove unicode "whitespace"
by Ratazong (Monsignor) on Feb 28, 2013 at 07:49 UTC

    Unicode does not include LRM in the 26 characters it considers whitespace

    Just for completeness: the list what is considered whitespace can be found here (sub-section white space).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020983]
[Lady_Aleena]: Maybe...
[talexb]: Sounds like a Try It And See question to me ..
[1nickt]: heh, I hab\ve been. Trying to understand it.
[1nickt]: perl -e 'print 1.0' ... output '1'.
[Lady_Aleena]: You could quote it. <c>perl -e 'print "1.0"'>/c> returns 1.0

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2017-05-24 18:43 GMT
Find Nodes?
    Voting Booth?