http://www.perlmonks.org?node_id=1199324

albert has asked for the wisdom of the Perl Monks concerning the following question:

This is a follow-up to my related question Counting text with ligatures.

I have two columns of text which might contain a grapheme (such as ff, ffi). How can find a substring at a fixed column position, counting the graphemes as '1' each? In this example, I was hoping to print '0123456789' for each line.

Note: I inserted the code as pre-formatted, since the code-tag encodes the graphemes. I tried to fix using substr from Unicode::GCString, but that didn't work as I hoped.

use strict;
use warnings;

use Unicode::GCString;

while (my $s = <DATA>){
	chomp($s);
	my $right = substr $s, 10;
	print $right, "\n";
	
#	my $gcs = Unicode::GCString->new($s);
#	my $right2 =  $gcs->substr(10);
#	print $right2, "\n";
}

__DATA__
01234567  0123456789
0123456ff  0123456789
0123456ffi  0123456789
012ff4ffi67  0123456789
  • Comment on Finding substrings of fixed width text with graphemes

Replies are listed 'Best First'.
Re: Finding substrings of fixed width text with graphemes
by haukex (Archbishop) on Sep 13, 2017 at 16:18 UTC

    For grapheme support ("extended grapheme cluster" to be exact) built into Perl (\X), I would suggest you use a regex, for example my ($right) = $s=~/\A\X{10}(.*)\z/s; works.

    However, your substr example also works for me, although I did take your example data, store it in a file, making sure to use UTF-8, and then I opened it via open my $fh, '<:encoding(UTF-8)', 'input.txt' or die $!;. Your ligatures - at least the way you've posted them here - appear to be stored in one Unicode character each ("\x{FB00}" and "\x{FB03}").

    This leads me to suspect your problem might be occurring earlier, i.e. that your file is stored with a different encoding than the one you expect, or you are not opening it with the proper encoding in Perl. You might want to read the following recent threads for some general advice on dealing with encodings as well as specific advice on how to find out what encoding was used to store the file: Converting UTF8 to ANSI, Parsing issue (null bytes?), Parsing a Latin-1 Charset Data File - basically: 1. Be certain what encoding the data is stored with (looking at a hex dump of the file if necessary), 2. Open it with the proper encoding, as I showed above, and 3. Inspect the data once you've gotten it into Perl to make sure it was read properly (e.g. using Data::Dump). Only then can you properly use the facilities Perl provides to deal with Unicode.

    Update: Clarified wording a bit.

      I was indeed seeing an encoding issue. When I make sure to use UTF-8, I get exactly the desired behavior. Thanks.