Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Remove unicode "whitespace"

by HYanWong (Acolyte)
on Feb 27, 2013 at 22:56 UTC ( #1020973=perlquestion: print w/replies, xml ) Need Help??
HYanWong has asked for the wisdom of the Perl Monks concerning the following question:

I have a number of strings which terminate in the unicode character E2 80 8E. A bit of searching tells me this is the left-to-right mark (LRM), and it's not uncommon to find this in user-inputted data. These have been converted from user-inputted links in wikimedia commons, such as

I'm trying to trim "whitespace" from the end of these strings, and this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE. I guess there are a number of other unicode control characters that are basically pointless when at the end of a string. Are there any perl modules that will trim strings, taking these unicode characters into account? Or do I have to look them all up myself :(

Replies are listed 'Best First'.
Re: Remove unicode "whitespace"
by 7stud (Deacon) on Feb 28, 2013 at 03:36 UTC

    I have a number of strings which terminate in the unicode character E2 80 8E.

    There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some *unicode integer* (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM.

    this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE.

    Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings.

    The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character:

    use strict; use warnings; use 5.012; say hex("200E"); #8206 my $str = "hello\N{LRM}"; if ($str =~ / hello ( #Start of $1 \p{FORMAT} #One char in Unicode FORMAT category ) #End of $1 /xms) { #Standard flags say ord($1); #8206 }
    Here's a list of the 139 characters in the FORMAT category.

      Great. Thanks for the tip about \p{FORMAT}, and the correction about Unicode terminology. I'll try stripping my strings using /[\s\p{FORMAT}]*$//g then.

      Unicode does not include LRM in the 26 characters it considers whitespace

      Just for completeness: the list what is considered whitespace can be found here (sub-section white space).
Re: Remove unicode "whitespace"
by Khen1950fx (Canon) on Feb 28, 2013 at 05:23 UTC
    Are you sure that LRM is just "whitespace"? I did some googling, and I'm getting a different take on it. As I understand it, LRM is a bidirectional, zero-width character that is necessary for determining text-direction of mixed data, using the Bidi algorithm. If that's true, I could be wrong, then you don't want to trim the LRM's from the links. ikegami could probably explain it better:-).

      You're right that it is the LRM character, and so shouldn't be stripped in general (so it's sensible that it doesn't match \s). But it is useless at the end of a string, hence my suggestion that it should be considered something like whitespace in that context. I hoped there might be a function to trim the end of strings for this specific purpose. Or if not, something generic I could add to a RE to strip unicode characters of this nature.

        Give URI::Encode a try.
        #!usr/bin/perl -l use strict; use warnings; use URI::Encode qw(uri_decode); my $encoded = ' /wiki/File:Atelerix_algirus.jpg%E2%80%8E'; print uri_decode($encoded);
Re: Remove unicode "whitespace"
by ikegami (Pope) on Mar 01, 2013 at 10:25 UTC
    $ unichars -au '\s' ---- U+00009 CHARACTER TABULATION ---- U+0000A LINE FEED (LF) ---- U+0000C FORM FEED (FF) ---- U+0000D CARRIAGE RETURN (CR) ---- U+00020 SPACE ---- U+00085 NEXT LINE (NEL) ---- U+000A0 NO-BREAK SPACE ---- U+01680 OGHAM SPACE MARK ---- U+0180E MONGOLIAN VOWEL SEPARATOR ---- U+02000 EN QUAD ---- U+02001 EM QUAD ---- U+02002 EN SPACE ---- U+02003 EM SPACE ---- U+02004 THREE-PER-EM SPACE ---- U+02005 FOUR-PER-EM SPACE ---- U+02006 SIX-PER-EM SPACE ---- U+02007 FIGURE SPACE ---- U+02008 PUNCTUATION SPACE ---- U+02009 THIN SPACE ---- U+0200A HAIR SPACE ---- U+02028 LINE SEPARATOR ---- U+02029 PARAGRAPH SEPARATOR ---- U+0202F NARROW NO-BREAK SPACE ---- U+0205F MEDIUM MATHEMATICAL SPACE ---- U+03000 IDEOGRAPHIC SPACE $ uniprops -a U+200E U+200E ‹U+200E› \N{LEFT-TO-RIGHT MARK} \pC \p{Cf} All Any Assigned Bidi_C Bidi_Control BidiC InGeneralPunctuation C Other Case_Ignorable CI Cf Format Changes_When_NFKC_Casefolded CWKCF Common Zyyy Default_Ignorable_Code_Point DI General_Punctuation Graph Pat_WS Pattern_White_Space PatWS Print X_POSIX_Graph X_POSIX_Print Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=General_Punctuation Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=CN Grapheme_Cluster_Break=Control GCB=CN Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=FO Sentence_Break=Format SB=FO Word_Break=FO Word_Break=Format WB=FO _Case_Ignorable

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1020973]
Approved by Athanasius
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2018-06-24 10:42 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.