Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Remove unicode "whitespace"

by 7stud (Deacon)
on Feb 28, 2013 at 03:36 UTC ( #1020983=note: print w/replies, xml ) Need Help??

in reply to Remove unicode "whitespace"

I have a number of strings which terminate in the unicode character E2 80 8E.

There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some *unicode integer* (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM.

this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE.

Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings.

The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character:

use strict; use warnings; use 5.012; say hex("200E"); #8206 my $str = "hello\N{LRM}"; if ($str =~ / hello ( #Start of $1 \p{FORMAT} #One char in Unicode FORMAT category ) #End of $1 /xms) { #Standard flags say ord($1); #8206 }
Here's a list of the 139 characters in the FORMAT category.

Replies are listed 'Best First'.
Re^2: Remove unicode "whitespace"
by HYanWong (Acolyte) on Feb 28, 2013 at 11:20 UTC

    Great. Thanks for the tip about \p{FORMAT}, and the correction about Unicode terminology. I'll try stripping my strings using /[\s\p{FORMAT}]*$//g then.

Re^2: Remove unicode "whitespace"
by Ratazong (Monsignor) on Feb 28, 2013 at 07:49 UTC

    Unicode does not include LRM in the 26 characters it considers whitespace

    Just for completeness: the list what is considered whitespace can be found here (sub-section white space).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020983]
[ambrus]: Corion: well Prima::Object says something like that the cleanup method will send an onDestory message and that you can't get more messages after cleanup, or something.
[Corion]: ambrus: Yeah - I don't think the deep source dive will be necessary if things are implemented as simple as they could be :)) And hopefully I won't need (more) timely object destruction. I can update the screen at 60Hz and hopefully even do HTTP ...
[Corion]: ... transfers in the background. Now that I think about it, this maybe even means that I can run the OpenGL filters on Youtube input :)
[ambrus]: Corion: I mentioned that the unix event loop of Prima always wakes up at least once every 0.2 seconds. Have you found out whether the win32 event loop of Prima does that too?
[Corion]: ambrus: Hmm - I would assume that the onDestroy message is sent from the destructor and doesn't go through the messageloop, but maybe it is sent when a window gets destroyed but all components are still alive...
[ambrus]: Corion: partly deep source dive, partly just conservative coding even if it adds an overhead.
[Corion]: ambrus: Hmm - no, I haven't looked at wakeup intervals ... I wonder why it should want to wakeup periodically because it gets a lot of messages from the Windows message loop (on Windows obviously)
[ambrus]: (Alternately a deep source dive and then rewrite that event loop to make it better, and then as a bonus you get an idle method.)
[ambrus]: The 0.2 seconds wakeup is likely a workaround for some bug, but I can't guess what bug that is.
[ambrus]: It's been there since Prima 1.00 iirc

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2016-12-09 10:29 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (150 votes). Check out past polls.