http://www.perlmonks.org?node_id=1199301

albert has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file which contains typographic ligatures such as "ff" and "ffi". How do I get the length as "1" for each ligature, rather than 2 for "ff" and 3 for "ffi", etc.
$char = "ffi"; $len = length($char);
In preview, I see that my "ffi" is getting encoded as & #64259, so can't properly make the example. Here is my sample code formatted as tt.

$char = "ffi";
$len = length($char);

How do I get $len as 1 in this example?

Replies are listed 'Best First'.
Re: Counting text with ligatures
by haukex (Archbishop) on Sep 13, 2017 at 13:46 UTC

    I assume what you want to count is "Graphemes" (see also perluniintro). You should use Perl v5.12 or better; here are a couple of ways (see \X and \b{gcb}, as well as my post here):

    my $string = "k\x{0301}u\x{032D}o\x{0304}\x{0301}n"; print "length: ",length($string),"\n"; # wrong way my $len = () = $string=~/\X/g; print "len: $len\n"; my @graphs = split /\X\K(?=\X)/, $string; print "graphs: ", 0+@graphs, "\n"; # in Perl v5.22+: my @graphs2 = split /\b{gcb}/, $string; print "graphs2: ", 0+@graphs2, "\n"; __END__ length: 8 len: 4 graphs: 4 graphs2: 4

      At least on Perl 5.14 and Perl 5.20, this doesn't work (and I don't understand why):

      use strict; use charnames ":full"; my $string = "\N{LATIN SMALL LIGATURE FFI}"; print "length: ",length($string),"\n"; # wrong way my $len = () = $string=~/\X/g; print "len: $len\n"; my @graphs = split /\X\K(?=\X)/, $string; print "graphs: ", 0+@graphs, "\n"; __END__ length: 1 len: 1 graphs: 1

      Is maybe our understanding of graphemes different from the separate letters of the ligatures?

        My initial understanding of the OP's question was that it has to do with Unicode being able to represent the same user-visible character in multiple different ways, like with combining characters. That is, the two strings "\N{LATIN SMALL LETTER E WITH ACUTE}" and "e\N{COMBINING ACUTE ACCENT}" report different lengths (1 resp. 2), even though on the screen they both look like "é" (one "grapheme"), and so users would expect a "length" of each string to be reported as 1. I may have misunderstood the OP's question though - if you have the strings "ffi" vs. "ffi", and you want to know if they have the same length and/or are equal, then perhaps what the OP is looking for is Unicode equivalence (normalization).

        use Unicode::Normalize; use Data::Dump; dd NFD("\N{LATIN SMALL LETTER E WITH ACUTE}"), NFD("e\N{COMBINING ACUTE ACCENT}"); dd NFC("\N{LATIN SMALL LETTER E WITH ACUTE}"), NFC("e\N{COMBINING ACUTE ACCENT}"); dd NFKD("\N{LATIN SMALL LIGATURE FFI}"); __END__ ("e\x{301}", "e\x{301}") ("\xE9", "\xE9") "ffi"

        Updated example code to include the "é" examples.

Re: Counting text with ligatures
by hippo (Bishop) on Sep 13, 2017 at 14:01 UTC
    How do I get $len as 1 in this example?

    Works for me:

    $ cat lig.t
    use strict;
    use warnings;
    use utf8;
    
    use Test::More tests => 1;
    my $char = "ffi";
    my $len = length($char);
    is ($len, 1);
    $ perl lig.t
    1..1
    ok 1
    $ perl -v
    
    This is perl 5, version 20, subversion 3 (v5.20.3) built for x86_64-linux-thread-multi
    (with 16 registered patches, see perl -V for more detail)
    

    If you are working with these sorts of characters you could do a lot worse than go through the length() miscounting UTF8 characters? thread.

      Thanks for point to this thread. I knew there would be something similar, but didn't get the right search.
Re: Counting text with ligatures
by albert (Monk) on Sep 13, 2017 at 15:58 UTC