Finding substrings of fixed width text with graphemes

albert has asked for the wisdom of the Perl Monks concerning the following question:

This is a follow-up to my related question Counting text with ligatures.

I have two columns of text which might contain a grapheme (such as ﬀ, ﬃ). How can find a substring at a fixed column position, counting the graphemes as '1' each? In this example, I was hoping to print '0123456789' for each line.

Note: I inserted the code as pre-formatted, since the code-tag encodes the graphemes. I tried to fix using substr from Unicode::GCString, but that didn't work as I hoped.

use strict;
use warnings;

use Unicode::GCString;

while (my $s = <DATA>){
	chomp($s);
	my $right = substr $s, 10;
	print $right, "\n";
	
#	my $gcs = Unicode::GCString->new($s);
#	my $right2 =  $gcs->substr(10);
#	print $right2, "\n";
}

__DATA__
01234567  0123456789
0123456ﬀ  0123456789
0123456ﬃ  0123456789
012ﬀ4ﬃ67  0123456789

Comment on Finding substrings of fixed width text with graphemes

Replies are listed 'Best First'.
Re: Finding substrings of fixed width text with graphemes by haukex (Archbishop) on Sep 13, 2017 at 16:18 UTC
For grapheme support ("extended grapheme cluster" to be exact) built into Perl (`\X`), I would suggest you use a regex, for example `my ($right) = $s=~/\A\X{10}(.)\z/s;` works. However, your substr example also works for me, although I did take your example data, store it in a file, making sure to use UTF-8, and then I opened it via `open my $fh, '<:encoding(UTF-8)', 'input.txt' or die $!;`. Your ligatures - at least the way you've posted them here - appear to be stored in one Unicode character each (`"\x{FB00}"` and `"\x{FB03}"`). This leads me to suspect your problem might be occurring earlier, i.e. that your file is stored with a different encoding than the one you expect, or you are not opening it with the proper encoding in Perl. You might want to read the following recent threads for some general advice on dealing with encodings as well as specific advice on how to find out what encoding was used to store the file: Converting UTF8 to ANSI, Parsing issue (null bytes?), Parsing a Latin-1 Charset Data File - basically: 1. Be certain* what encoding the data is stored with (looking at a hex dump of the file if necessary), 2. Open it with the proper encoding, as I showed above, and 3. Inspect the data once you've gotten it into Perl to make sure it was read properly (e.g. using Data::Dump). Only then can you properly use the facilities Perl provides to deal with Unicode. Update: Clarified wording a bit.	[reply] [d/l] [select]
Re^2: Finding substrings of fixed width text with graphemes by albert (Monk) on Sep 13, 2017 at 16:35 UTC
I was indeed seeing an encoding issue. When I make sure to use UTF-8, I get exactly the desired behavior. Thanks.	[reply]

Back to Seekers of Perl Wisdom