Perl: the Markov chain saw PerlMonks

### Re: Counting text with ligatures

by haukex (Abbot)
 on Sep 13, 2017 at 13:46 UTC ( #1199304=note: print w/replies, xml ) Need Help??

in reply to Counting text with ligatures

I assume what you want to count is "Graphemes" (see also perluniintro). You should use Perl v5.12 or better; here are a couple of ways (see \X and \b{gcb}, as well as my post here):

```my \$string = "k\x{0301}u\x{032D}o\x{0304}\x{0301}n";
print "length: ",length(\$string),"\n"; # wrong way

my \$len = () = \$string=~/\X/g;
print "len: \$len\n";

my @graphs  = split /\X\K(?=\X)/, \$string;
print "graphs: ", 0+@graphs, "\n";

# in Perl v5.22+:
my @graphs2 = split /\b{gcb}/, \$string;
print "graphs2: ", 0+@graphs2, "\n";

__END__

length: 8
len: 4
graphs: 4
graphs2: 4

Replies are listed 'Best First'.
Re^2: Counting text with ligatures
by Corion (Pope) on Sep 13, 2017 at 13:51 UTC

At least on Perl 5.14 and Perl 5.20, this doesn't work (and I don't understand why):

```use strict;
use charnames ":full";

my \$string = "\N{LATIN SMALL LIGATURE FFI}";
print "length: ",length(\$string),"\n"; # wrong way

my \$len = () = \$string=~/\X/g;
print "len: \$len\n";

my @graphs  = split /\X\K(?=\X)/, \$string;
print "graphs: ", 0+@graphs, "\n";

__END__

length: 1
len: 1
graphs: 1

Is maybe our understanding of graphemes different from the separate letters of the ligatures?

My initial understanding of the OP's question was that it has to do with Unicode being able to represent the same user-visible character in multiple different ways, like with combining characters. That is, the two strings "\N{LATIN SMALL LETTER E WITH ACUTE}" and "e\N{COMBINING ACUTE ACCENT}" report different lengths (1 resp. 2), even though on the screen they both look like "é" (one "grapheme"), and so users would expect a "length" of each string to be reported as 1. I may have misunderstood the OP's question though - if you have the strings "ffi" vs. "ﬃ", and you want to know if they have the same length and/or are equal, then perhaps what the OP is looking for is Unicode equivalence (normalization).

```use Unicode::Normalize;
use Data::Dump;
dd NFD("\N{LATIN SMALL LETTER E WITH ACUTE}"),
NFD("e\N{COMBINING ACUTE ACCENT}");
dd NFC("\N{LATIN SMALL LETTER E WITH ACUTE}"),
NFC("e\N{COMBINING ACUTE ACCENT}");
dd NFKD("\N{LATIN SMALL LIGATURE FFI}");
__END__
("e\x{301}", "e\x{301}")
("\xE9", "\xE9")
"ffi"

Updated example code to include the "é" examples.

Create A New User
Node Status?
node history
Node Type: note [id://1199304]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2018-06-20 23:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Should cpanminus be part of the standard Perl release?

Results (117 votes). Check out past polls.

Notices?