Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Unicode combining characters as hash keys?

by Anonymous Monk
on Sep 02, 2011 at 20:56 UTC ( #923932=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been reading a lot these days about unicode because I need to work with characters of the IPA (International Phonetic Alphabet), including diacritics.

Figuring things out in Perl has been quite easy thanks to the good documentation (unlike getting my xterm to display unicode... *gr*), but I'm stuck on one last detail.

In the following minimal example, I want to look up things in a hash, and in some cases, the hash key is one of those "combinings characters" (a "small alpha" with a tilde on top of it). The script works ok for all other keys, but breaks when when a combining character is used as the key. I don't know what to do here. Can anyone help me?

file "test_hash.txt"
d 0 0 a 0 1 ɑ̃ 1 0 s 1 1
file "test_input.txt"
das dɑ̃s
the script itself:
#!/usr/bin/perl use strict; use warnings; binmode(STDOUT, ":utf8"); open HASH, '<:utf8', "test_hash.txt"; my %hash = (); while (my $line=<HASH>) { chomp $line; $line =~ s/^(.*?)\t//; my $key = $1; my @line = split /\s+/, $line; $hash{$key} = \@line; } # foreach my $phoneme (keys %hash) { # print $phoneme . ":"; # my @line = @{ $hash{$phoneme} }; # print join ",", @line; # print "\n"; # } open INPUTFILE, '<:utf8', "test_input.txt";; while (my $entry = <INPUTFILE>) { chomp $entry; print $entry . "\n"; my @word = $entry =~ /(\X)/g; for (my $j=0; $j<@word; $j++) { my $letter = $word[$j]; my $input_features = $hash{$letter}; my @input_features = @{ $input_features }; print join " ", @input_features; print "\n"; } }

Comment on Unicode combining characters as hash keys?
Select or Download Code
Re: Unicode combining characters as hash keys?
by ikegami (Pope) on Sep 02, 2011 at 21:36 UTC

    Hashes have no problem with combining marks.

    $ perl -MData::Dumper -E' $Data::Dumper::Useqq = 1; $_ = "e\x{301}"; $h{$_} = 1; say $h{$_} || 0; ' 1

    The problem is probably that the character appears both in composed and decomposed form.

    $ perl -E' $Data::Dumper::Useqq = 1; $a = "e\x{301}"; $b = "\xE9"; $h{$a} = 1; say $h{$b} || 0; ' 0

    You can use Unicode::Normalize's NFC or NFD to normalize the form. (Doesn't matter which, as long as you're consistent.)

    $ perl -MUnicode::Normalize=NFC -E' $Data::Dumper::Useqq = 1; $a = NFC("e\x{301}"); $b = NFC("\xE9"); $h{$a} = 1; say $h{$b} || 0; ' 1

    I'm kinda guessing here since you used 20 lines of code to describe a problem that could be described with 2.

      > Hashes have no problem with combining marks.

      Thanks for the confirmation, I thought that was how I understood the documentation, but I wasn't sure.

      > The problem is probably that the character appears both in composed and decomposed form.

      If they do, I don't understand where it comes from. I entered them both in composed form. Note the two "a"-like things are different, but you only see it if you use a font that makes a difference between "small alpha" and "a".

      > You can use Unicode::Normalize's NFC or NFD to normalize the form.

      Thanks. I read up on normalization and tried replacing
      my $key = $1;
      with
      my $key = NFD($1); # or NFC, or even NFKD/NFKC
      as well as replacing the last for-loop with
      foreach my $letter (@letters) { my $norm_letter = NFD($letter); my @features = @{ $hash{$norm_letter} }; print join " ", @features; print "\n"; }

      It still doesn't work though. The error message complains about not finding $nfd_letter in the hash, although the for-loop I commented out in the original script definitely shows it was added:
      Can't use an undefined value as an ARRAY reference at ./test.plx line 31, <INPUTFILE> line 2.

      The script is thus now (including the changes suggested by Jim, too):

      use Unicode::Normalize; binmode(STDOUT, ':encoding(UTF-8)'); open HASH, '<:encoding(UTF-8)', 'test_hash.txt'; my %hash = (); while (my $line=<HASH>) { chomp $line; $line =~ s/^(.*?)\t//; my $key = NFD($1); my @line = split /\s+/, $line; $hash{$key} = \@line; } # foreach my $phoneme (keys %hash) { # print $phoneme . ":"; # my @line = @{ $hash{$phoneme} }; # print join ",", @line; # print "\n"; # } open INPUTFILE, '<:encoding(UTF-8)', 'test_input.txt'; while (my $entry = <INPUTFILE>) { chomp $entry; print $entry . "\n"; my @letters = $entry =~ /(\X)/g; foreach my $letter (@letters) { my $norm_letter = NFD($letter); my @features = @{ $hash{$norm_letter} }; print join " ", @features; print "\n"; } }
        It still doesn't work though. The error message complains about not finding $nfd_letter in the hash

        Hmm. Very odd. There is no variable named $nfd_letter in your script.

        This complete, self-contained, pared-down script

        #!perl use strict; use warnings; use charnames qw( :full ); use Unicode::Normalize; binmode STDOUT, ':encoding(UTF-8)'; # Lookup table of IPA properties by IPA symbol my %properties_by = ( d => [ qw( 0 0 ) ], a => [ qw( 0 1 ) ], NFD("\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}") => [ qw( 1 0 ) ], s => [ qw( 1 1 ) ], ); # List of IPA phonemes my @phonemes = ( "das", "d\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}s" ); # For each IPA phoneme... for my $phoneme (map { NFD($_) } @phonemes) { # ...examine each IPA symbol in it... for my $symbol ($phoneme =~ m/(\X)/g) { # ...and look up each symbol's IPA properties... my $properties = join ", ", @{ $properties_by{$symbol} }; print "$phoneme\t$symbol\t$properties\n"; } } exit 0;

        produces this output

        das	d	0, 0
        das	a	0, 1
        das	s	1, 1
        dɑ̃s	d	0, 0
        dɑ̃s	ɑ̃	1, 0
        dɑ̃s	s	1, 1
        

        By the way, there is no precomposed Unicode character that is the Latin small letter alpha with a tilde. So all the Unicode normalizations of "\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}" are exactly the sameand exactly that.

        It still doesn't work though.

        Ok, so start by finding out what the difference is between the two keys. Perhaps using Data::Dumper. (Set $Data::Dumper::Useqq = 1; first.)

        Once you find the difference, we can discuss how to address the difference.

Re: Unicode combining characters as hash keys?
by Jim (Curate) on Sep 02, 2011 at 22:32 UTC

    Use

    binmode(STDOUT, ':encoding(UTF-8)');

    instead of

    binmode(STDOUT, ":utf8");

    Likewise, use

    open HASH, '<:encoding(UTF-8)', 'test_hash.txt'; ... open INPUTFILE, '<:encoding(UTF-8)', 'test_input.txt';

    instead of

    open HASH, '<:utf8', "test_hash.txt"; ... open INPUTFILE, '<:utf8', "test_input.txt";

    Also, this looks wrong to me:

    my @word = $entry =~ /(\X)/g;

    Shouldn't that be

    my @word = $entry =~ /(\X+)/g;

    instead?

    UPDATE: Upon reexamination, it looks right to me. :-/ Using the variable name @letters instead of @word would be an improvement, though. Then

    for my $letter (@letters) { my @input_features = @{ $hash{$letter} }; print join(" ", @input_features) . "\n"; }

      Thanks, Jim.

      I didn't understand the part you crossed out either at first. I didn't know using /.../g would return an array. But I found in the Perl Cookbook :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://923932]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2014-09-17 03:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (57 votes), past polls