Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Remove unicode "whitespace"

by Khen1950fx (Canon)
on Feb 28, 2013 at 05:23 UTC ( #1020995=note: print w/replies, xml ) Need Help??

in reply to Remove unicode "whitespace"

Are you sure that LRM is just "whitespace"? I did some googling, and I'm getting a different take on it. As I understand it, LRM is a bidirectional, zero-width character that is necessary for determining text-direction of mixed data, using the Bidi algorithm. If that's true, I could be wrong, then you don't want to trim the LRM's from the links. ikegami could probably explain it better:-).

Replies are listed 'Best First'.
Re^2: Remove unicode "whitespace"
by HYanWong (Acolyte) on Feb 28, 2013 at 11:16 UTC

    You're right that it is the LRM character, and so shouldn't be stripped in general (so it's sensible that it doesn't match \s). But it is useless at the end of a string, hence my suggestion that it should be considered something like whitespace in that context. I hoped there might be a function to trim the end of strings for this specific purpose. Or if not, something generic I could add to a RE to strip unicode characters of this nature.

      Give URI::Encode a try.
      #!usr/bin/perl -l use strict; use warnings; use URI::Encode qw(uri_decode); my $encoded = ' /wiki/File:Atelerix_algirus.jpg%E2%80%8E'; print uri_decode($encoded);

        Yes, I've done that. It converts the %E2%80%8E string to the unicode LRM character, which isn't printed, but is still embedded in the string, causing problems when accessing the URL again. Thanks, though.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020995]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2018-01-16 23:58 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (194 votes). Check out past polls.