Note: take what I say here with a grain of salt since I know no Vietnamese.
Here's the Vietnamese alphabet sort order. And here's how to read that chart:
- First column (darkest colour) has the letter in question
- The other columns have the glyphs that sort under that letter
- Therefore, ấ and Ầ and ậ sort under â (will be found in the dictionary under the heading 'â')
- In the case where the two words are otherwise 100% equivalent (except for the diacritics), sort in the left-to-right order given in the chart.
Here's how I handled Japanese sorting (hiragana only) based on a similar chart for Japanese:
sub transliterate {
my $str = shift;
$str =~
tr(がぎぐげござじずぜぞだぢづでどばびぶべぼぱぴぷぺぽっゃゅょ)
(かきくけこさしすせそたちつてとはひふへほはひふへほつやゆよ);
return $str;
}
sub gozyuuon {
$a->{'sort'} cmp $b->{'sort'} ||
$a->{'reading'} cmp $b->{'reading'};
}
my @rows = (
{ word => '同時', reading => 'どうじ' },
{ word => '当日', reading => 'とうじつ' },
{ word => '同士', reading => 'どうし' },
{ word => '投資', reading => 'とうし' },
{ word => '当時', reading => 'とうじ' },
{ word => '同室', reading => 'どうしつ' },
);
# create a version with the dakuten (") stripped
for (@rows) {
$_->{'sort'} = transliterate($_->{reading});
}
for my $row (sort gozyuuon @rows) {
printf "%s・%s\n", $row->{reading}, $row->{word};
}
Japanese is a bit easier since the unicode codepoints are in correct order already; I only needed to handle the equivalent-sort-order characters.