Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Unicode combining characters as hash keys?

by Anonymous Monk
on Sep 03, 2011 at 08:11 UTC ( #923970=note: print w/ replies, xml ) Need Help??


in reply to Re: Unicode combining characters as hash keys?
in thread Unicode combining characters as hash keys?

> Hashes have no problem with combining marks.

Thanks for the confirmation, I thought that was how I understood the documentation, but I wasn't sure.

> The problem is probably that the character appears both in composed and decomposed form.

If they do, I don't understand where it comes from. I entered them both in composed form. Note the two "a"-like things are different, but you only see it if you use a font that makes a difference between "small alpha" and "a".

> You can use Unicode::Normalize's NFC or NFD to normalize the form.

Thanks. I read up on normalization and tried replacing
my $key = $1;
with
my $key = NFD($1); # or NFC, or even NFKD/NFKC
as well as replacing the last for-loop with
foreach my $letter (@letters) { my $norm_letter = NFD($letter); my @features = @{ $hash{$norm_letter} }; print join " ", @features; print "\n"; }

It still doesn't work though. The error message complains about not finding $nfd_letter in the hash, although the for-loop I commented out in the original script definitely shows it was added:
Can't use an undefined value as an ARRAY reference at ./test.plx line 31, <INPUTFILE> line 2.

The script is thus now (including the changes suggested by Jim, too):

use Unicode::Normalize; binmode(STDOUT, ':encoding(UTF-8)'); open HASH, '<:encoding(UTF-8)', 'test_hash.txt'; my %hash = (); while (my $line=<HASH>) { chomp $line; $line =~ s/^(.*?)\t//; my $key = NFD($1); my @line = split /\s+/, $line; $hash{$key} = \@line; } # foreach my $phoneme (keys %hash) { # print $phoneme . ":"; # my @line = @{ $hash{$phoneme} }; # print join ",", @line; # print "\n"; # } open INPUTFILE, '<:encoding(UTF-8)', 'test_input.txt'; while (my $entry = <INPUTFILE>) { chomp $entry; print $entry . "\n"; my @letters = $entry =~ /(\X)/g; foreach my $letter (@letters) { my $norm_letter = NFD($letter); my @features = @{ $hash{$norm_letter} }; print join " ", @features; print "\n"; } }


Comment on Re^2: Unicode combining characters as hash keys?
Select or Download Code
Replies are listed 'Best First'.
Re^3: Unicode combining characters as hash keys?
by ikegami (Pope) on Sep 05, 2011 at 06:58 UTC

    It still doesn't work though.

    Ok, so start by finding out what the difference is between the two keys. Perhaps using Data::Dumper. (Set $Data::Dumper::Useqq = 1; first.)

    Once you find the difference, we can discuss how to address the difference.

Re^3: Unicode combining characters as hash keys?
by Jim (Curate) on Sep 03, 2011 at 21:52 UTC
    It still doesn't work though. The error message complains about not finding $nfd_letter in the hash…

    Hmm. Very odd. There is no variable named $nfd_letter in your script.

    This complete, self-contained, pared-down script…

    #!perl use strict; use warnings; use charnames qw( :full ); use Unicode::Normalize; binmode STDOUT, ':encoding(UTF-8)'; # Lookup table of IPA properties by IPA symbol my %properties_by = ( d => [ qw( 0 0 ) ], a => [ qw( 0 1 ) ], NFD("\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}") => [ qw( 1 0 ) ], s => [ qw( 1 1 ) ], ); # List of IPA phonemes my @phonemes = ( "das", "d\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}s" ); # For each IPA phoneme... for my $phoneme (map { NFD($_) } @phonemes) { # ...examine each IPA symbol in it... for my $symbol ($phoneme =~ m/(\X)/g) { # ...and look up each symbol's IPA properties... my $properties = join ", ", @{ $properties_by{$symbol} }; print "$phoneme\t$symbol\t$properties\n"; } } exit 0;

    …produces this output…

    das	d	0, 0
    das	a	0, 1
    das	s	1, 1
    dɑ̃s	d	0, 0
    dɑ̃s	ɑ̃	1, 0
    dɑ̃s	s	1, 1
    

    By the way, there is no precomposed Unicode character that is the Latin small letter alpha with a tilde. So all the Unicode normalizations of "\N{LATIN SMALL LETTER ALPHA}\N{COMBINING TILDE}" are exactly the same—and exactly that.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://923970]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2015-07-31 03:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls