Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Unicode combining characters as hash keys?

by ikegami (Pope)
on Sep 02, 2011 at 21:36 UTC ( #923938=note: print w/ replies, xml ) Need Help??


in reply to Unicode combining characters as hash keys?

Hashes have no problem with combining marks.

$ perl -MData::Dumper -E' $Data::Dumper::Useqq = 1; $_ = "e\x{301}"; $h{$_} = 1; say $h{$_} || 0; ' 1

The problem is probably that the character appears both in composed and decomposed form.

$ perl -E' $Data::Dumper::Useqq = 1; $a = "e\x{301}"; $b = "\xE9"; $h{$a} = 1; say $h{$b} || 0; ' 0

You can use Unicode::Normalize's NFC or NFD to normalize the form. (Doesn't matter which, as long as you're consistent.)

$ perl -MUnicode::Normalize=NFC -E' $Data::Dumper::Useqq = 1; $a = NFC("e\x{301}"); $b = NFC("\xE9"); $h{$a} = 1; say $h{$b} || 0; ' 1

I'm kinda guessing here since you used 20 lines of code to describe a problem that could be described with 2.


Comment on Re: Unicode combining characters as hash keys?
Select or Download Code
1+2 replies Replies are listed 'Oldest First'.
Re^2: Unicode combining characters as hash keys?
by Anonymous Monk on Sep 03, 2011 at 08:11 UTC

    > Hashes have no problem with combining marks.

    Thanks for the confirmation, I thought that was how I understood the documentation, but I wasn't sure.

    > The problem is probably that the character appears both in composed and decomposed form.

    If they do, I don't understand where it comes from. I entered them both in composed form. Note the two "a"-like things are different, but you only see it if you use a font that makes a difference between "small alpha" and "a".

    > You can use Unicode::Normalize's NFC or NFD to normalize the form.

    Thanks. I read up on normalization and tried replacing
    my $key = $1;
    with
    my $key = NFD($1); # or NFC, or even NFKD/NFKC
    as well as replacing the last for-loop with
    foreach my $letter (@letters) { my $norm_letter = NFD($letter); my @features = @{ $hash{$norm_letter} }; print join " ", @features; print "\n"; }

    It still doesn't work though. The error message complains about not finding $nfd_letter in the hash, although the for-loop I commented out in the original script definitely shows it was added:
    Can't use an undefined value as an ARRAY reference at ./test.plx line 31, <INPUTFILE> line 2.

    The script is thus now (including the changes suggested by Jim, too):

    use Unicode::Normalize; binmode(STDOUT, ':encoding(UTF-8)'); open HASH, '<:encoding(UTF-8)', 'test_hash.txt'; my %hash = (); while (my $line=<HASH>) { chomp $line; $line =~ s/^(.*?)\t//; my $key = NFD($1); my @line = split /\s+/, $line; $hash{$key} = \@line; } # foreach my $phoneme (keys %hash) { # print $phoneme . ":"; # my @line = @{ $hash{$phoneme} }; # print join ",", @line; # print "\n"; # } open INPUTFILE, '<:encoding(UTF-8)', 'test_input.txt'; while (my $entry = <INPUTFILE>) { chomp $entry; print $entry . "\n"; my @letters = $entry =~ /(\X)/g; foreach my $letter (@letters) { my $norm_letter = NFD($letter); my @features = @{ $hash{$norm_letter} }; print join " ", @features; print "\n"; } }
      It still doesn't work though. The error message complains about not finding $nfd_letter in the hash…

      Hmm. Very odd. There is no variable named $nfd_letter in your script.

      This complete, self-contained, pared-down script…

      #!perl use strict; use warnings; use charnames qw( :full ); use Unicode::Normalize; binmode STDOUT, ':encoding(UTF-8)'; # Lookup table of IPA properties by IPA symbol my %properties_by = ( d => [ qw( 0 0 ) ], a => [ qw( 0 1 ) ], NFD("\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}") => [ qw( 1 0 ) ], s => [ qw( 1 1 ) ], ); # List of IPA phonemes my @phonemes = ( "das", "d\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}s" ); # For each IPA phoneme... for my $phoneme (map { NFD($_) } @phonemes) { # ...examine each IPA symbol in it... for my $symbol ($phoneme =~ m/(\X)/g) { # ...and look up each symbol's IPA properties... my $properties = join ", ", @{ $properties_by{$symbol} }; print "$phoneme\t$symbol\t$properties\n"; } } exit 0;

      …produces this output…

      das	d	0, 0
      das	a	0, 1
      das	s	1, 1
      dɑ̃s	d	0, 0
      dɑ̃s	ɑ̃	1, 0
      dɑ̃s	s	1, 1
      

      By the way, there is no precomposed Unicode character that is the Latin small letter alpha with a tilde. So all the Unicode normalizations of "\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}" are exactly the same—and exactly that.

      It still doesn't work though.

      Ok, so start by finding out what the difference is between the two keys. Perhaps using Data::Dumper. (Set $Data::Dumper::Useqq = 1; first.)

      Once you find the difference, we can discuss how to address the difference.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://923938]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (20)
As of 2015-07-07 16:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (91 votes), past polls