Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?


Syntactic Confectionery Delight
	PerlMonks

Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?

by moritz (Cardinal)

on Apr 17, 2010 at 07:36 UTC ( [id://835241]=note: print w/replies, xml )

Need Help??

⭐ in reply to How do I normalize (e.g. strip) diacritical märks from a Unicode string?

The trick is to split the letters with diacritical marks into the base letter and the mark, which Unicode::Normalize does with the NFD function. Then the regex /\pM/ identifies marking characters (see perlunicode).

use strict;
use warnings;

use utf8;

use Unicode::Normalize;

my $s = "söme stüff\n";
$s = NFD($s);

$s =~ s/\pM//g;
print $s;
[download]

Depending on the application, the NFKD might or might not be more appropriate than NFD.

The code snippet above removes all marking characters, not just diacritical marks. You can change that by removing only \x{308}. The following code strips the diacritical mark, but leaves the accents:

use strict;
use warnings;

use utf8;

use Unicode::Normalize;
binmode STDOUT, ':utf8';

my $s = "söme stüff with áccènts\n";
$s = NFD($s);

$s =~ s/\x{308}//g;
$s = NFC($s);
print $s;
[download]

Comment on Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string? Select or Download Code

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://835241]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others imbibing at the Monastery: (8)

As of 2024-04-26 08:55 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found