note
farang
<p><blockquote>the perldoc entry for length (which I checked
beforehand to make sure it wouldn't count bytes -- hence my
confusion)</blockquote>
It "normally deals in logical characters", but its logic doesn't
cover all the intricacies of unicode.</p>
<p><blockquote>Do you have any specific languages or complex data in mind with which it might fail?</blockquote>
Yes, Thai language is the main one I'm involved with. The modified
script below shows that <tt>length</tt> counts diacriticals in
Thai, which may or may not be what is wanted, and is inconsistent
with the results for Latin diacriticals in your dataset, which
<tt>length</tt> isn't counting separately. I'm using
pre tags so that the Thai will display correctly and shortened lines to facilitate copy/paste.
<pre>
#!/usr/bin/env perl
use warnings;
use v5.14;
use Unicode::Normalize qw/NFD/;
binmode STDOUT, 'utf8';
binmode DATA, 'encoding(utf-8)';
while (<DATA>) {
chomp;
print $_, ': ';
s/[A-Za-z]//g;
my $alphacount = () = /\p{Alpha}/g;
say "non-(A-Za-z) symbols <$_>",
" contain $alphacount",
" alphabetic characters and ",
getdia($_), " diacritical chars.";
say "length() thinks there are ",
length, " characters\n";
}
sub getdia {
my $normalized = NFD($_[0]);
my $diacount = () =
$normalized =~ /\p{Dia}/g;
return $diacount;
}
__DATA__
เป็น
ผู้หญิง
เมื่อวันก่อน
æðaber
æðahnútur
æðakölkun
</pre></p>
1084035
1084101