http://www.perlmonks.org?node_id=1226566

gibus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I've stumbled on a strange behaviour with hash keys, happening on every Perl version I could test from 5.16 to 5.26

It has been asked some years ago on stack overflow but without any answer on whether it is an optimization bug or an expected behaviour.

The issue is that if you initialize a hash with a key having non-ascii (for eg. iso-8859-1) characters, the key is properly encoded in UTF8 (with UTF8 flag on). But then if you assign a value to the hash element corresponding to this key, the key is downgraded (probably encoded in iso-8859-1). You can imagine the consequences if you have to do some processing on this key, expecting it to be UTF8 encoded…

Here's a script showing the issue:

#!/usr/bin/perl use strict; use warnings; use utf8; use Devel::Peek; use Data::Dumper; $Data::Dumper::Useqq = 1; my %hash = ( 'clé' => 0, ); my $key = (keys %hash)[0]; Dump($key); print Dumper($key); $hash{'clé'} = 1; $key = (keys %hash)[0]; Dump($key); print Dumper($key); utf8::upgrade($key); Dump($key); print Dumper($key);

with the following output:

SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1993ed0 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x555ed1909b10 "cl\351" CUR = 3 LEN = 0 $VAR1 = "cl\351"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1825350 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 10 $VAR1 = "cl\x{e9}";

As shown with this code, the issue can be solved by upgrading the key to UTF8. But I would never have thought I should have done it before stumbling to this issue. I've never read anything in perldoc explaining this behaviour. Do you think it's expected for some reason ? Thanks!

Replies are listed 'Best First'.
Re: UTF8 hash key downgraded when assigned
by ikegami (Patriarch) on Dec 01, 2018 at 04:11 UTC

    Just like Perl is free to store numbers in the format of its choice (IV, UV, NV, PV), Perl is free to store strings in the format of its choice (UTF8=0, UTF8=1). It's also it's free to change the storage format used at any time, but that's not what's happening here. Consider the following snippet

    use utf8; utf8::upgrade( my $key_u = "clé" ); utf8::downgrade( my $key_d = "clé" ); my %hash; ++$hash{$key_u}; ++$hash{$key_d}; my ($key) = keys(%hash);

    As this shows, keys is perfectly justified in returning the string in any format. If your code assigns semantics to the UTF8 flag, it's buggy. (It is said "to suffer from The Unicode Bug".)

Re: UTF8 hash key downgraded when assigned
by syphilis (Archbishop) on Dec 01, 2018 at 00:57 UTC
    Hi gibus,
    The code you posted won't even compile for me (on Windows):
    Malformed UTF-8 character: \xe9\x27\x20 (unexpected non-continuation b +yte 0x27,immediately after start byte 0xe9; need 3 bytes, got 1) at t +ry.pl line 11. Malformed UTF-8 character (fatal) at try.pl line 11.
    Seems to work fine for me if I rewrite your script as:
    #!/usr/bin/perl use strict; use warnings; use utf8; use Devel::Peek; use Data::Dumper; $Data::Dumper::Useqq = 1; my $s; { no utf8; $s = 'clé'; } utf8::upgrade($s); my %hash = ( $s => 0, ); my $key = (keys %hash)[0]; Dump($key); print Dumper($key); $hash{$s} = 1; $key = (keys %hash)[0]; Dump($key); print Dumper($key); utf8::upgrade($key); # does nothing Dump($key); print Dumper($key);
    UPDATE: When I initially posted this rewritten version, my utf8::upgrade($s); was done inside the no utf8{} block - which is rather counter-intuitive, to say the least.
    So I've subsequently moved it outside the no utf8{} block.

    UPDATE 2: The output of my modified script:
    SV = PV(0x84c2a8) at 0x373100 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x247ede8 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}"; SV = PV(0x84c2a8) at 0x373100 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x247f4d8 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}"; SV = PV(0x84c2a8) at 0x373100 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x247f4d8 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}";

    HTH.

    Cheers,
    Rob
      Have you saved it as UTF-8? \xe9\x27\x20 seems to be cp1252 for é'.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        Have you saved it as UTF-8?

        No ... and can't immediately find a way of doing so on this Windows machine.
        Is that the reason the script, as posted by the OP, failed to compile for me ?

        I thought that my script might have been relevant, since its output matched the output the OP expected.
        But if it's not relevant then please let me know (and I'll mark it so).

        Cheers,
        Rob
Re: UTF8 hash key downgraded when assigned ( use utf8; )
by beech (Parson) on Dec 01, 2018 at 02:33 UTC

    Hi

    I dont understand the problem, why are you looking at the utf flag?

    It appears to be something to do with use utf8;

    No UTF8 flag for \xE9, not before nor after

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; use Devel::Peek qw/ Dump /; my %f = ( qq{cl\xE9}, qq{cl\xE9} ); Dump($_)for keys%f; $f{qq{cl\xE9}}=qq{cl\xE9}; $f{qq{\x{2665} \N{U+1F42A}}}=qq{\x{2665} \N{U+1F42A}}; dd\%f; dd(keys %f); Dump($_)for keys%f; __END__ SV = PV(0x3f7a8c) at 0x3f9b74 REFCNT = 2 FLAGS = (POK,FAKE,READONLY,pPOK) PV = 0x9a8ad0 "cl\351" CUR = 3 LEN = 0 { "cl\xE9" => "cl\xE9", "\x{2665} \x{1F42A}" => "\x{2665} \x{1F42A}" } ("\x{2665} \x{1F42A}", "cl\xE9") SV = PV(0xb98cc4) at 0x3f9b84 REFCNT = 2 FLAGS = (POK,FAKE,READONLY,pPOK,UTF8) PV = 0xa78390 "\342\231\245 \360\237\220\252" [UTF8 "\x{2665} \x{1f4 +2a}"] CUR = 8 LEN = 0 SV = PV(0x3f7ac4) at 0x99b9c4 REFCNT = 2 FLAGS = (POK,FAKE,READONLY,pPOK) PV = 0x9a8ad0 "cl\351" CUR = 3 LEN = 0

    But utf8.pm doesn't like it

    $ perl -le "use Data::Dump; use utf8; dd qq{é}" Malformed UTF-8 character (1 byte, need 3, after start byte 0xe9) at - +e line 1. "\0" $ perl -le "use Data::Dump; dd qq{é}" "\xE9"
Re: UTF8 hash key downgraded when assigned
by 1nickt (Canon) on Dec 01, 2018 at 02:42 UTC

    Hi, I am not seeing the same behaviour as you. (Also, keep in mind that almost never does one have to muck around with or even think about Perl's internal flag, which has little to do with usage of the string.) I'm dumping as you did, but note the existing test in Test::utf8.


    use Test::Most tests => 6;
    use Test::utf8;
    use utf8;
    binmode(STDOUT, ':utf8');
    use Devel::Peek;
    
    for my $str ('clé', '键') {
        is_flagged_utf8($str);
        Dump $str;
    
        my %hash    = ($str => 0);
        $hash{$str} = 1;
        (my $key)   = keys %hash;
    
        is_flagged_utf8($key);
        Dump $key;
    
        $key =~ s/(?:clé|键)/ключ/;
    
        is_flagged_utf8($key);
        Dump $key;
    
        print "$key\n";
    }
    
    __END__
    

    Output (square brackets turned into links, but all the better to highlight the relevant lines in the dumps):
    $ prove -lrv 1226566.pl
    1226566.pl .. 
    1..6
    ok 1 - flagged as utf8
    ok 2 - flagged as utf8
    ok 3 - flagged as utf8
    ключ
    ok 4 - flagged as utf8
    ok 5 - flagged as utf8
    ok 6 - flagged as utf8
    ключ
    SV = PV(0x556a75a9e160) at 0x556a75ac3fe8
      REFCNT = 2
      FLAGS = (POK,IsCOW,READONLY,PROTECT,pPOK,UTF8)
      PV = 0x556a7667b370 "cl\303\251"\0 UTF8 "cl\x{e9}"
      CUR = 4
      LEN = 10
      COW_REFCNT = 0
    SV = PV(0x556a765f31d0) at 0x556a7652d1a8
      REFCNT = 1
      FLAGS = (POK,pPOK,UTF8)
      PV = 0x556a76698180 "cl\303\251"\0 UTF8 "cl\x{e9}"
      CUR = 4
      LEN = 5
    SV = PV(0x556a765f31d0) at 0x556a7652d1a8
      REFCNT = 1
      FLAGS = (POK,pPOK,UTF8)
      PV = 0x556a764c8b30 "\320\272\320\273\321\216\321\207"\0 UTF8 "\x{43a}\x{43b}\x{44e}\x{447}"
      CUR = 8
      LEN = 16
    SV = PV(0x556a763289f0) at 0x556a75ac3f28
      REFCNT = 2
      FLAGS = (POK,IsCOW,READONLY,PROTECT,pPOK,UTF8)
      PV = 0x556a765197e0 "\351\224\256"\0 UTF8 "\x{952e}"
      CUR = 3
      LEN = 10
      COW_REFCNT = 0
    SV = PV(0x556a765f31d0) at 0x556a7652d1a8
      REFCNT = 1
      FLAGS = (POK,IsCOW,pPOK,UTF8)
      PV = 0x556a75ac7a80 "\351\224\256" UTF8 "\x{952e}"
      CUR = 3
      LEN = 0
    SV = PV(0x556a765f31d0) at 0x556a7652d1a8
      REFCNT = 1
      FLAGS = (POK,pPOK,UTF8)
      PV = 0x556a7666eec0 "\320\272\320\273\321\216\321\207"\0 UTF8 "\x{43a}\x{43b}\x{44e}\x{447}"
      CUR = 8
      LEN = 16
    ok
    All tests successful.
    Files=1, Tests=6,  0 wallclock secs ( 0.01 usr  0.00 sys +  0.06 cusr  0.00 csys =  0.07 CPU)
    Result: PASS
    

    Hope this helps!


    The way forward always starts with a minimal test.
Re: UTF8 hash key downgraded when assigned
by gibus (Acolyte) on Dec 01, 2018 at 10:47 UTC

    Thanks everybody for your replies. I should add that the issue is raised when using a constant key, i.e. $hash{'clé'}, not if hash key is stored in a variable, i.e. $hash{$key}.

    Actually, a fellow French Monger has pointed me to this monks' post which confirms it is a bug in the optimisation of constant hash keys.