Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

UTF8 URI Escaping

by snoopy20 (Novice)
on Apr 11, 2012 at 11:06 UTC ( #964503=perlquestion: print w/replies, xml ) Need Help??
snoopy20 has asked for the wisdom of the Perl Monks concerning the following question:

This is driving me nuts. I have a form with special characters like a pound or euro sign. It's URIencoded and Perl receives it into a variable, so for the pound sign: %C2%A3 Note that %C2 is actually not needed I don't know why it's included. Anyway, I then run the usual on it: $f =~ s/%([a-fA-F0-9]{2})/pack('C', hex($1))/eg; Which gives octals.. \xc2\xa3 But how do I get this back to the pound sign? I have tried everything and google searching for two hours. It should be simple!

Replies are listed 'Best First'.
Re: UTF8 URI Escaping
by Eliya (Vicar) on Apr 11, 2012 at 11:24 UTC

    The two octet sequence c2 a3 is the UTF-8 encoding of the pound character, so for Perl to treat it as one single character, you need to decode it:

    use Encode; my $f = "%C2%A3"; $f =~ s/%([a-fA-F0-9]{2})/pack('C', hex($1))/eg; my $decoded = decode("UTF-8", $f);

    And then, depending on what you want to do with the decoded string on the output side, you might want to encode it again — usually done via setting the appropriate PerlIO encoding layer for the respective file handle.

Re: UTF8 URI Escaping
by moritz (Cardinal) on Apr 11, 2012 at 11:26 UTC
      Ok, so anyone have a working example? For example if I add: my $v = decode("UTF-8", $v); In an attempt to decode the aforementioned then I _still_ get: \xc2\xa3
        ... I _still_ get: \xc2\xa3

        How do you tell?  Did you print it to somewhere, etc.?  What exactly do you finally want to achieve?

        In case the idea is to convert the pound sign to Latin-1 encoding, you'd need to encode the decoded string again:

        $v = encode('ISO-8859-1', $v);

        Here's a complete example (note the differences in the PV representations in the 3 dumps):

        #!/usr/bin/perl -w use strict; use Encode; use Devel::Peek; my $v= "%C2%A3"; $v =~ s/%([a-fA-F0-9]{2})/pack('C', hex($1))/eg; Dump $v; my $decoded = decode("UTF-8", $v); Dump $decoded; my $encoded = encode("ISO-8859-1", $decoded); Dump $encoded; __END__ SV = PVMG(0xf98470) at 0xf15d08 REFCNT = 1 FLAGS = (PADMY,SMG,POK,pPOK) IV = 0 NV = 0 PV = 0xf46af0 "\302\243"\0 CUR = 2 LEN = 8 MAGIC = 0xf0f350 MG_VIRTUAL = &PL_vtbl_mglob MG_TYPE = PERL_MAGIC_regex_global(g) MG_LEN = -1 SV = PV(0xeee388) at 0xf16080 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0xfe40c0 "\302\243"\0 [UTF8 "\x{a3}"] CUR = 2 LEN = 8 SV = PV(0xeee448) at 0xf16110 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xfd44c0 "\243"\0 CUR = 1 LEN = 8
Re: UTF8 URI Escaping
by Corion (Pope) on Apr 11, 2012 at 11:16 UTC

    You will have to learn about encodings, and unicode. See perluniintro.

Re: UTF8 URI Escaping
by Your Mother (Bishop) on Apr 11, 2012 at 20:52 UTC
    I then run the usual on it ... It should be simple!

    There is a big assumption out there that this stuff is so easy that you can bypass the standard libraries. This assumption only holds if you know the dozens of related RFCs inside and out and if you do, you are going to lean on someone else's implementation anyway because it will be roughly identical functionally to anything you'd write.

    Using either of the standards (CGI, URI::Escape) for this kind of thing would have saved you all that lost time and plenty more in the future.

    perl -MURI::Escape -le 'print uri_unescape("%C2%A3")' £ perl -MCGI=param -le 'print param("q")' "q=%C2%A3" £


    Eliya rightly points out that I was missing the point. So, here's a bit more answer instead of knee-jerk, use the CPAN. I am assuming the output is meant for web, though this isnít actually stated in the OP.

    Plack is necessary for this but makes it super easy to try stuff soĖ

    Plain uri_unescape, and therefore the original code snippet, is fine if you are sending the output, bytes that are utf-8, not Perl decoded strings. The response is fine because itís undecoded bytes.

    plackup -e 'use URI::Escape; sub { [200, ["Content-Type" => "text/html +; charset=utf-8"], [ uri_unescape("%C2%A3") ]]}' HTTP::Server::PSGI: Accepting connections at http://0:5000/ -- £

    Now with decoding to Perlís utf-8. It doesnít work because the output needs to be encoded to bytes and youíll generally get errors or warnings to that effect.

    plackup -e 'use Encode; use URI::Escape; sub { [200, ["Content-Type" = +> "text/html; charset=utf-8"], [ decode("UTF-8", uri_unescape("%C2%A3 +")) ]]}' HTTP::Server::PSGI: Accepting connections at http://0:5000/ -- Error: Body must be bytes and should not contain wide characters (UTF- +8 strings) at /usr/local/lib/perl5/site_perl/5.14.0/Plack/Middleware/ line 153

    Now double encoded just to see because it seems to crop up a lot when mixing approaches.

    plackup -e 'use Encode; use URI::Escape; sub { [200, ["Content-Type" = +> "text/html; charset=utf-8"], [ encode("UTF-8", uri_unescape("%C2%A3 +")) ]]}' HTTP::Server::PSGI: Accepting connections at http://0:5000/ -- £

    And improved/corrected versions of the CGI example. Using the -utf8 arg CGI will automatically decode things for you. This is what you want so you can deal with content correctly in regular expressions and such. Itís your responsibility to make sure the output handle is UTF-8 or that you encode to bytes. The character is right in Perl here but wrong for the output layer.

    perl -MCGI=param,-utf8 -le 'print param("q")' "q=%C2%A3" ?

    Using -CO to get utf-8 on the output layer it works fine.

    perl -CO -MCGI=param,-utf8 -le 'print param("q")' "q=%C2%A3" £

    Or, encoding the utf-8 to bytes.

    perl -MEncode -MCGI=param,-utf8 -le 'print encode("UTF-8", param("q")) +' "q=%C2%A3" £

    Anyway, the first answers in the thread were, taken together, all quite thorough. This was just to have a little to play with and recant my grumpy and erroneous first stab.

      ...would have saved you all that lost time and plenty more in the future.

      Don't overestimate what some modules do :)

      URI::Escape::uri_unescape() does exactly the same substitution the OP posted, i.e. it also has exactly the same issues.

      For one, it doesn't decode the UTF-8 encoded string into a single Unicode character. Rather, it just returns the two octets \xc2 \xa3 (which the OP seems to have some problem with...).  In other words, your sample would only work with a UTF-8 capable terminal, which is rendering the glyph '£' when it receives the two bytes c2 a3.

      $ perl -MURI::Escape -le 'print uri_unescape("%C2%A3")' | od -tx1 0000000 c2 a3 0a 0000003

      And, as Devel::Peek::Dump shows, the returned string isn't decoded (a Perl Unicode string):

      $ perl -MURI::Escape -MDevel::Peek -e 'Dump uri_unescape("%C2%A3")' SV = PVMG(0x1762450) at 0x16ceff0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) IV = 0 NV = 0 PV = 0x16e5c90 "\302\243"\0 CUR = 2 LEN = 8

      This is exactly the same result the OP had achieved with his original code.

        Yes, yes. Quite right. Too glossy there, though if the output layer was right, it would have been fine, and I would also have -utf8 flag in the CGI, with proper encodings on the layers and the HTTP headers, or decoding manually the input, or Encoding::Unicode in the Catalyst plugin list, orÖ The real point being rolling your own can bite seasoned devs, it *will* bite neophytes and create maintenance nightmares for, well, me because I seem to inherit an endless stream of code written in this style.

      I appreciate the responses and I'm trying my best to understand them. I also tried $v = uri_unescape($v) which still converts to: \xc2\xa3. So, can anyone give me a command like $v=xxx($v) using ANY module that will convert the above or the %C2%A3 to a pound sign?
Re: UTF8 URI Escaping
by Anonymous Monk on Apr 11, 2012 at 11:25 UTC
Re: UTF8 URI Escaping
by Khen1950fx (Canon) on Apr 11, 2012 at 18:08 UTC
    You could use URI::Escape:
    #!/usr/bin/perl -l use strict; use warnings; use URI::Escape; my $f = '%C2%A3'; my $safe = uri_escape($f); my $str = uri_unescape($safe); print uri_unescape($str);
    Update: Added a test for utf8.
    #!/usr/bin/perl -l use strict; use warnings; use Encode; use URI::Escape; require Encode::Detect; my $f = '%C2%A3'; my $safe = uri_escape($f); my $str = uri_unescape($safe); print my $data = uri_unescape($str); my $utf8 = decode( "Detect", $data ); binmode STDOUT, ":encoding(utf8)"; print "$utf8: Looks like utf8 to me";
Re: UTF8 URI Escaping
by ajinkyagadewar (Novice) on Apr 12, 2012 at 08:48 UTC
    When you talk about form, do you want to do the form submission using Mechanize or any module and unable to do that due to currency encoding?
      Not sure if you have tried URI::Escape..
      use URI::Escape; $str = uri_unescape("%c2 %a2"); print "$str";
        I send that to the log with die and it says... \xc2 \xa2 If I send that to DBI::Mysql it saves it as: £
      I simply want to parse $ENV{QUERY_STRING} and convert any % encodings into the correct £, Ä or whatever (in unicode). I would rather not rely on external libraries where possible but right not I'll take anything that works. The closest I've got is the \x encodings which is not enough. Whoops I meant STDIN, but it's encoded the same anyways ( /^application\/x-www-form-urlencoded/)
Re: UTF8 URI Escaping
by ajinkyagadewar (Novice) on Apr 12, 2012 at 13:33 UTC
    To remove that accented character u can use Text::Unaccent
      That does not solve the problem. For example the EURO sign is displayed entirely differently as three encodings (%xx) and I might later need languages with accents. I do not understand why this is so hard to figure out, or why nothing works. (further head banging to follow)...
        I forgot to close the question, I've sorted it. It was a mix of some settings in Perl, ensuring MySQL is in UTF mode and ensuring 'use utf8' is in ALL perl files. Writeup is here.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://964503]
Approved by marto
Front-paged by MidLifeXis
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2018-04-25 20:35 GMT
Find Nodes?
    Voting Booth?