Re^2: Counting text with ligatures

by Corion (Pope)
At least on Perl 5.14 and Perl 5.20, this doesn't work (and I don't understand why):

use strict; use charnames ":full"; my $string = "\N{LATIN SMALL LIGATURE FFI}"; print "length: ",length($string),"\n"; # wrong way my $len = () = $string=~/\X/g; print "len: $len\n"; my @graphs = split /\X\K(?=\X)/, $string; print "graphs: ", 0+@graphs, "\n"; __END__ length: 1 len: 1 graphs: 1

Is maybe our understanding of graphemes different from the separate letters of the ligatures?

Re^3: Counting text with ligatures
by haukex (Abbot) on Sep 13, 2017 at 14:02 UTC

    My initial understanding of the OP's question was that it has to do with Unicode being able to represent the same user-visible character in multiple different ways, like with combining characters. That is, the two strings "\N{LATIN SMALL LETTER E WITH ACUTE}" and "e\N{COMBINING ACUTE ACCENT}" report different lengths (1 resp. 2), even though on the screen they both look like "" (one "grapheme"), and so users would expect a "length" of each string to be reported as 1. I may have misunderstood the OP's question though - if you have the strings "ffi" vs. "ffi", and you want to know if they have the same length and/or are equal, then perhaps what the OP is looking for is Unicode equivalence (normalization).

    use Unicode::Normalize; use Data::Dump; dd NFD("\N{LATIN SMALL LETTER E WITH ACUTE}"), NFD("e\N{COMBINING ACUTE ACCENT}"); dd NFC("\N{LATIN SMALL LETTER E WITH ACUTE}"), NFC("e\N{COMBINING ACUTE ACCENT}"); dd NFKD("\N{LATIN SMALL LIGATURE FFI}"); __END__ ("e\x{301}", "e\x{301}") ("\xE9", "\xE9") "ffi"

    Updated example code to include the "" examples.

