Counting text with ligatures

albert has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file which contains typographic ligatures such as "ﬀ" and "ﬃ". How do I get the length as "1" for each ligature, rather than 2 for "ﬀ" and 3 for "ﬃ", etc.

$char = "&#64259;";
$len = length($char);
[download]

In preview, I see that my "ﬃ" is getting encoded as & #64259, so can't properly make the example. Here is my sample code formatted as tt.

$char = "ﬃ"; $len = length($char);

How do I get $len as 1 in this example?

Comment on Counting text with ligatures Download Code

Replies are listed 'Best First'.
Re: Counting text with ligatures by haukex (Archbishop) on Sep 13, 2017 at 13:46 UTC
I assume what you want to count is "Graphemes" (see also perluniintro). You should use Perl v5.12 or better; here are a couple of ways (see `\X` and `\b{gcb}`, as well as my post here): `my $string = "k\x{0301}u\x{032D}o\x{0304}\x{0301}n"; print "length: ",length($string),"\n"; # wrong way my $len = () = $string=~/\X/g; print "len: $len\n"; my @graphs = split /\X\K(?=\X)/, $string; print "graphs: ", 0+@graphs, "\n"; # in Perl v5.22+: my @graphs2 = split /\b{gcb}/, $string; print "graphs2: ", 0+@graphs2, "\n"; __END__ length: 8 len: 4 graphs: 4 graphs2: 4` [download]	[reply] [d/l] [select]
Re^2: Counting text with ligatures by Corion (Patriarch) on Sep 13, 2017 at 13:51 UTC
At least on Perl 5.14 and Perl 5.20, this doesn't work (and I don't understand why): `use strict; use charnames ":full"; my $string = "\N{LATIN SMALL LIGATURE FFI}"; print "length: ",length($string),"\n"; # wrong way my $len = () = $string=~/\X/g; print "len: $len\n"; my @graphs = split /\X\K(?=\X)/, $string; print "graphs: ", 0+@graphs, "\n"; __END__ length: 1 len: 1 graphs: 1` [download] Is maybe our understanding of graphemes different from the separate letters of the ligatures?	[reply] [d/l]
Re^3: Counting text with ligatures by haukex (Archbishop) on Sep 13, 2017 at 14:02 UTC
My initial understanding of the OP's question was that it has to do with Unicode being able to represent the same user-visible character in multiple different ways, like with combining characters. That is, the two strings `"\N{LATIN SMALL LETTER E WITH ACUTE}"` and `"e\N{COMBINING ACUTE ACCENT}"` report different lengths (1 resp. 2), even though on the screen they both look like "é" (one "grapheme"), and so users would expect a "length" of each string to be reported as 1. I may have misunderstood the OP's question though - if you have the strings `"ffi"` vs. `"ﬃ"`, and you want to know if they have the same length and/or are equal, then perhaps what the OP is looking for is Unicode equivalence (normalization). `use Unicode::Normalize; use Data::Dump; dd NFD("\N{LATIN SMALL LETTER E WITH ACUTE}"), NFD("e\N{COMBINING ACUTE ACCENT}"); dd NFC("\N{LATIN SMALL LETTER E WITH ACUTE}"), NFC("e\N{COMBINING ACUTE ACCENT}"); dd NFKD("\N{LATIN SMALL LIGATURE FFI}"); __END__ ("e\x{301}", "e\x{301}") ("\xE9", "\xE9") "ffi"` [download] Updated example code to include the "é" examples.	[reply] [d/l] [select]
Re: Counting text with ligatures by hippo (Bishop) on Sep 13, 2017 at 14:01 UTC
How do I get $len as 1 in this example? Works for me: $ cat lig.t use strict; use warnings; use utf8; use Test::More tests => 1; my $char = "ﬃ"; my $len = length($char); is ($len, 1); $ perl lig.t 1..1 ok 1 $ perl -v This is perl 5, version 20, subversion 3 (v5.20.3) built for x86_64-linux-thread-multi (with 16 registered patches, see perl -V for more detail) If you are working with these sorts of characters you could do a lot worse than go through the length() miscounting UTF8 characters? thread.	[reply]
Re^2: Counting text with ligatures by albert (Monk) on Sep 13, 2017 at 14:04 UTC
Thanks for point to this thread. I knew there would be something similar, but didn't get the right search.	[reply]
Re: Counting text with ligatures by albert (Monk) on Sep 13, 2017 at 15:58 UTC
The answers to my question only got me part of the way of where I really wanted to be. As I want to find substrings of fixed width text with graphemes. See Finding substrings of fixed width text with graphemes for the follow-up.	[reply]

Back to Seekers of Perl Wisdom