Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: Finding substrings of fixed width text with graphemes

by haukex (Abbot)
on Sep 13, 2017 at 16:18 UTC ( #1199328=note: print w/replies, xml ) Need Help??

in reply to Finding substrings of fixed width text with graphemes

For grapheme support ("extended grapheme cluster" to be exact) built into Perl (\X), I would suggest you use a regex, for example my ($right) = $s=~/\A\X{10}(.*)\z/s; works.

However, your substr example also works for me, although I did take your example data, store it in a file, making sure to use UTF-8, and then I opened it via open my $fh, '<:encoding(UTF-8)', 'input.txt' or die $!;. Your ligatures - at least the way you've posted them here - appear to be stored in one Unicode character each ("\x{FB00}" and "\x{FB03}").

This leads me to suspect your problem might be occurring earlier, i.e. that your file is stored with a different encoding than the one you expect, or you are not opening it with the proper encoding in Perl. You might want to read the following recent threads for some general advice on dealing with encodings as well as specific advice on how to find out what encoding was used to store the file: Converting UTF8 to ANSI, Parsing issue (null bytes?), Parsing a Latin-1 Charset Data File - basically: 1. Be certain what encoding the data is stored with (looking at a hex dump of the file if necessary), 2. Open it with the proper encoding, as I showed above, and 3. Inspect the data once you've gotten it into Perl to make sure it was read properly (e.g. using Data::Dump). Only then can you properly use the facilities Perl provides to deal with Unicode.

Update: Clarified wording a bit.

Replies are listed 'Best First'.
Re^2: Finding substrings of fixed width text with graphemes
by albert (Monk) on Sep 13, 2017 at 16:35 UTC
    I was indeed seeing an encoding issue. When I make sure to use UTF-8, I get exactly the desired behavior. Thanks.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1199328]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2018-05-26 12:36 GMT
Find Nodes?
    Voting Booth?