I've been reading a lot these days about unicode because I need to work with characters of the IPA (International Phonetic Alphabet), including diacritics.
Figuring things out in Perl has been quite easy thanks to the good documentation (unlike getting my xterm to display unicode... *gr*), but I'm stuck on one last detail.
In the following minimal example, I want to look up things in a hash, and in some cases, the hash key is one of those "combinings characters" (a "small alpha" with a tilde on top of it). The script works ok for all other keys, but breaks when when a combining character is used as the key. I don't know what to do here. Can anyone help me?
file "test_hash.txt"
d 0 0
a 0 1
ɑ̃ 1 0
s 1 1
file "test_input.txt"
das
dɑ̃s
the script itself:
#!/usr/bin/perl
use strict;
use warnings;
binmode(STDOUT, ":utf8");
open HASH, '<:utf8', "test_hash.txt";
my %hash = ();
while (my $line=<HASH>) {
chomp $line;
$line =~ s/^(.*?)\t//;
my $key = $1;
my @line = split /\s+/, $line;
$hash{$key} = \@line;
}
# foreach my $phoneme (keys %hash) {
# print $phoneme . ":";
# my @line = @{ $hash{$phoneme} };
# print join ",", @line;
# print "\n";
# }
open INPUTFILE, '<:utf8', "test_input.txt";;
while (my $entry = <INPUTFILE>) {
chomp $entry;
print $entry . "\n";
my @word = $entry =~ /(\X)/g;
for (my $j=0; $j<@word; $j++) {
my $letter = $word[$j];
my $input_features = $hash{$letter};
my @input_features = @{ $input_features };
print join " ", @input_features;
print "\n";
}
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.