http://www.perlmonks.org?node_id=964623


in reply to UTF8 URI Escaping

I then run the usual on it ... It should be simple!

There is a big assumption out there that this stuff is so easy that you can bypass the standard libraries. This assumption only holds if you know the dozens of related RFCs inside and out and if you do, you are going to lean on someone else's implementation anyway because it will be roughly identical functionally to anything you'd write.

Using either of the standards (CGI, URI::Escape) for this kind of thing would have saved you all that lost time and plenty more in the future.

perl -MURI::Escape -le 'print uri_unescape("%C2%A3")' £ perl -MCGI=param -le 'print param("q")' "q=%C2%A3" £

Update

Eliya rightly points out that I was missing the point. So, here's a bit more answer instead of knee-jerk, use the CPAN. I am assuming the output is meant for web, though this isn’t actually stated in the OP.

Plack is necessary for this but makes it super easy to try stuff so–

Plain uri_unescape, and therefore the original code snippet, is fine if you are sending the output, bytes that are utf-8, not Perl decoded strings. The response is fine because it’s undecoded bytes.

plackup -e 'use URI::Escape; sub { [200, ["Content-Type" => "text/html +; charset=utf-8"], [ uri_unescape("%C2%A3") ]]}' HTTP::Server::PSGI: Accepting connections at http://0:5000/ -- £

Now with decoding to Perl’s utf-8. It doesn’t work because the output needs to be encoded to bytes and you’ll generally get errors or warnings to that effect.

plackup -e 'use Encode; use URI::Escape; sub { [200, ["Content-Type" = +> "text/html; charset=utf-8"], [ decode("UTF-8", uri_unescape("%C2%A3 +")) ]]}' HTTP::Server::PSGI: Accepting connections at http://0:5000/ -- Error: Body must be bytes and should not contain wide characters (UTF- +8 strings) at /usr/local/lib/perl5/site_perl/5.14.0/Plack/Middleware/ +Lint.pm line 153

Now double encoded just to see because it seems to crop up a lot when mixing approaches.

plackup -e 'use Encode; use URI::Escape; sub { [200, ["Content-Type" = +> "text/html; charset=utf-8"], [ encode("UTF-8", uri_unescape("%C2%A3 +")) ]]}' HTTP::Server::PSGI: Accepting connections at http://0:5000/ -- £

And improved/corrected versions of the CGI example. Using the -utf8 arg CGI will automatically decode things for you. This is what you want so you can deal with content correctly in regular expressions and such. It’s your responsibility to make sure the output handle is UTF-8 or that you encode to bytes. The character is right in Perl here but wrong for the output layer.

perl -MCGI=param,-utf8 -le 'print param("q")' "q=%C2%A3" ?

Using -CO to get utf-8 on the output layer it works fine.

perl -CO -MCGI=param,-utf8 -le 'print param("q")' "q=%C2%A3" £

Or, encoding the utf-8 to bytes.

perl -MEncode -MCGI=param,-utf8 -le 'print encode("UTF-8", param("q")) +' "q=%C2%A3" £

Anyway, the first answers in the thread were, taken together, all quite thorough. This was just to have a little to play with and recant my grumpy and erroneous first stab.

Replies are listed 'Best First'.
Re^2: UTF8 URI Escaping
by Eliya (Vicar) on Apr 12, 2012 at 03:31 UTC
    ...would have saved you all that lost time and plenty more in the future.

    Don't overestimate what some modules do :)

    URI::Escape::uri_unescape() does exactly the same substitution the OP posted, i.e. it also has exactly the same issues.

    For one, it doesn't decode the UTF-8 encoded string into a single Unicode character. Rather, it just returns the two octets \xc2 \xa3 (which the OP seems to have some problem with...).  In other words, your sample would only work with a UTF-8 capable terminal, which is rendering the glyph '£' when it receives the two bytes c2 a3.

    $ perl -MURI::Escape -le 'print uri_unescape("%C2%A3")' | od -tx1 0000000 c2 a3 0a 0000003

    And, as Devel::Peek::Dump shows, the returned string isn't decoded (a Perl Unicode string):

    $ perl -MURI::Escape -MDevel::Peek -e 'Dump uri_unescape("%C2%A3")' SV = PVMG(0x1762450) at 0x16ceff0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) IV = 0 NV = 0 PV = 0x16e5c90 "\302\243"\0 CUR = 2 LEN = 8

    This is exactly the same result the OP had achieved with his original code.

      Yes, yes. Quite right. Too glossy there, though if the output layer was right, it would have been fine, and I would also have -utf8 flag in the CGI, with proper encodings on the layers and the HTTP headers, or decoding manually the input, or Encoding::Unicode in the Catalyst plugin list, or… The real point being rolling your own can bite seasoned devs, it *will* bite neophytes and create maintenance nightmares for, well, me because I seem to inherit an endless stream of code written in this style.

Re^2: UTF8 URI Escaping
by snoopy20 (Novice) on Apr 12, 2012 at 07:07 UTC
    I appreciate the responses and I'm trying my best to understand them. I also tried $v = uri_unescape($v) which still converts to: \xc2\xa3. So, can anyone give me a command like $v=xxx($v) using ANY module that will convert the above or the %C2%A3 to a pound sign?