Re^2: Remove unicode "whitespace"

by HYanWong (Acolyte)
on Feb 28, 2013 at 11:16 UTC

in reply to Re: Remove unicode "whitespace"
in thread Remove unicode "whitespace"

You're right that it is the LRM character, and so shouldn't be stripped in general (so it's sensible that it doesn't match \s). But it is useless at the end of a string, hence my suggestion that it should be considered something like whitespace in that context. I hoped there might be a function to trim the end of strings for this specific purpose. Or if not, something generic I could add to a RE to strip unicode characters of this nature.

Re^3: Remove unicode "whitespace"
by Khen1950fx (Canon) on Feb 28, 2013 at 16:11 UTC
    Give URI::Encode a try.
    #!usr/bin/perl -l use strict; use warnings; use URI::Encode qw(uri_decode); my $encoded = ' /wiki/File:Atelerix_algirus.jpg%E2%80%8E'; print uri_decode($encoded);

      Yes, I've done that. It converts the %E2%80%8E string to the unicode LRM character, which isn't printed, but is still embedded in the string, causing problems when accessing the URL again. Thanks, though.

Node Type: note [id://1021039]
