gibus has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks
I've stumbled on a strange behaviour with hash keys, happening on every Perl version I could test from 5.16 to 5.26
It has been asked some years ago on stack overflow but without any answer on whether it is an optimization bug or an expected behaviour.
The issue is that if you initialize a hash with a key having non-ascii (for eg. iso-8859-1) characters, the key is properly encoded in UTF8 (with UTF8 flag on). But then if you assign a value to the hash element corresponding to this key, the key is downgraded (probably encoded in iso-8859-1). You can imagine the consequences if you have to do some processing on this key, expecting it to be UTF8 encoded…
Here's a script showing the issue:
#!/usr/bin/perl use strict; use warnings; use utf8; use Devel::Peek; use Data::Dumper; $Data::Dumper::Useqq = 1; my %hash = ( 'clé' => 0, ); my $key = (keys %hash)[0]; Dump($key); print Dumper($key); $hash{'clé'} = 1; $key = (keys %hash)[0]; Dump($key); print Dumper($key); utf8::upgrade($key); Dump($key); print Dumper($key);
with the following output:
SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1993ed0 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x555ed1909b10 "cl\351" CUR = 3 LEN = 0 $VAR1 = "cl\351"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1825350 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 10 $VAR1 = "cl\x{e9}";
As shown with this code, the issue can be solved by upgrading the key to UTF8. But I would never have thought I should have done it before stumbling to this issue. I've never read anything in perldoc explaining this behaviour. Do you think it's expected for some reason ? Thanks!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: UTF8 hash key downgraded when assigned
by ikegami (Patriarch) on Dec 01, 2018 at 04:11 UTC | |
Re: UTF8 hash key downgraded when assigned
by syphilis (Archbishop) on Dec 01, 2018 at 00:57 UTC | |
by choroba (Cardinal) on Dec 01, 2018 at 01:07 UTC | |
by syphilis (Archbishop) on Dec 01, 2018 at 01:44 UTC | |
by ikegami (Patriarch) on Dec 01, 2018 at 04:01 UTC | |
by syphilis (Archbishop) on Dec 01, 2018 at 05:18 UTC | |
Re: UTF8 hash key downgraded when assigned ( use utf8; )
by beech (Parson) on Dec 01, 2018 at 02:33 UTC | |
Re: UTF8 hash key downgraded when assigned
by 1nickt (Canon) on Dec 01, 2018 at 02:42 UTC | |
Re: UTF8 hash key downgraded when assigned
by gibus (Acolyte) on Dec 01, 2018 at 10:47 UTC |