Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Remove unicode "whitespace"

by 7stud (Deacon)
on Feb 28, 2013 at 03:36 UTC ( #1020983=note: print w/ replies, xml ) Need Help??


in reply to Remove unicode "whitespace"

I have a number of strings which terminate in the unicode character E2 80 8E.

There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some *unicode integer* (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM.

this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE.

Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings.

The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character:

use strict; use warnings; use 5.012; say hex("200E"); #8206 my $str = "hello\N{LRM}"; if ($str =~ / hello ( #Start of $1 \p{FORMAT} #One char in Unicode FORMAT category ) #End of $1 /xms) { #Standard flags say ord($1); #8206 }
Here's a list of the 139 characters in the FORMAT category.


Comment on Re: Remove unicode "whitespace"
Download Code
Re^2: Remove unicode "whitespace"
by Ratazong (Prior) on Feb 28, 2013 at 07:49 UTC

    Unicode does not include LRM in the 26 characters it considers whitespace

    Just for completeness: the list what is considered whitespace can be found here (sub-section white space).
Re^2: Remove unicode "whitespace"
by HYanWong (Acolyte) on Feb 28, 2013 at 11:20 UTC

    Great. Thanks for the tip about \p{FORMAT}, and the correction about Unicode terminology. I'll try stripping my strings using /[\s\p{FORMAT}]*$//g then.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020983]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (12)
As of 2014-07-29 19:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (226 votes), past polls