http://www.perlmonks.org?node_id=1199328


in reply to Finding substrings of fixed width text with graphemes

For grapheme support ("extended grapheme cluster" to be exact) built into Perl (\X), I would suggest you use a regex, for example my ($right) = $s=~/\A\X{10}(.*)\z/s; works.

However, your substr example also works for me, although I did take your example data, store it in a file, making sure to use UTF-8, and then I opened it via open my $fh, '<:encoding(UTF-8)', 'input.txt' or die $!;. Your ligatures - at least the way you've posted them here - appear to be stored in one Unicode character each ("\x{FB00}" and "\x{FB03}").

This leads me to suspect your problem might be occurring earlier, i.e. that your file is stored with a different encoding than the one you expect, or you are not opening it with the proper encoding in Perl. You might want to read the following recent threads for some general advice on dealing with encodings as well as specific advice on how to find out what encoding was used to store the file: Converting UTF8 to ANSI, Parsing issue (null bytes?), Parsing a Latin-1 Charset Data File - basically: 1. Be certain what encoding the data is stored with (looking at a hex dump of the file if necessary), 2. Open it with the proper encoding, as I showed above, and 3. Inspect the data once you've gotten it into Perl to make sure it was read properly (e.g. using Data::Dump). Only then can you properly use the facilities Perl provides to deal with Unicode.

Update: Clarified wording a bit.

Replies are listed 'Best First'.
Re^2: Finding substrings of fixed width text with graphemes
by albert (Monk) on Sep 13, 2017 at 16:35 UTC
    I was indeed seeing an encoding issue. When I make sure to use UTF-8, I get exactly the desired behavior. Thanks.