http://www.perlmonks.org?node_id=889135


in reply to Re: Unable to lc upper case accented characters
in thread Unable to lc upper case accented characters

wind++. Excellent recommendation!

D:\>chcp Active code page: 1252 D:\>type 889023.pl #!perl use strict; use warnings; use Lingua::EN::NameCase; binmode DATA, ':encoding(ISO-8859-1)'; binmode STDOUT, ':encoding(Windows-1252)'; while (my $original_name = <DATA>) { chomp $original_name; my $normalized_name = nc($original_name); printf "%30s %s\n", $original_name, $normalized_name; } __DATA__ MARILYN MCCORD ADAMS D'ALEMBERT, JEAN ÉTIENNE DE LA BOÉTIE ÉMILIE DU CHÂTELET HÉLÈNE CIXOUS DESCARTES, RENÉ durkheim, émile FREUD, SIGMUND GÖDEL, KURT þorsteinn gylfason OLIVER WENDELL HOLMES, JR. JUNG, CARL KANT, IMMANUEL MACHIAVELLI, NICCOLÒ MARX, KARL NIETZSCHE, FRIEDRICH ROUSSEAU, JEAN-JACQUES SARTRE, JEAN-PAUL SCHOPENHAUER, ARTHUR ANNE LOUISE GERMAINE DE STAËL D:\>perl 889023.pl MARILYN MCCORD ADAMS Marilyn McCord Adams D'ALEMBERT, JEAN D'Alembert, Jean ÉTIENNE DE LA BOÉTIE Étienne de la Boétie ÉMILIE DU CHÂTELET Émilie du Châtelet HÉLÈNE CIXOUS Hélène Cixous DESCARTES, RENÉ Descartes, René durkheim, émile Durkheim, Émile FREUD, SIGMUND Freud, Sigmund GÖDEL, KURT Gödel, Kurt þorsteinn gylfason Þorsteinn Gylfason OLIVER WENDELL HOLMES, JR. Oliver Wendell Holmes, Jr. JUNG, CARL Jung, Carl KANT, IMMANUEL Kant, Immanuel MACHIAVELLI, NICCOLÒ Machiavelli, Niccolò MARX, KARL Marx, Karl NIETZSCHE, FRIEDRICH Nietzsche, Friedrich ROUSSEAU, JEAN-JACQUES Rousseau, Jean-Jacques SARTRE, JEAN-PAUL Sartre, Jean-Paul SCHOPENHAUER, ARTHUR Schopenhauer, Arthur ANNE LOUISE GERMAINE DE STAËL Anne Louise Germaine de Staël D:\>

When I remove the two calls to binmode, the script produces the same output. This is due to the fact that Lingua::EN::NameCase calls use locale. So whereas wind wrote, "It's not going to help you with your special character issue," the truth is, at least on a Microsoft Windows computer with the right code page and regional (i.e., locale) settings, the module does take care of the character encoding for you. Obviously, it's better and safer to be explicit about the character encodings in your Perl script.

The module converts MCCORD to McCord, but it cleverly does not convert MACHIAVELLI to MacHiavelli. Perché no? Because Machiavelli ends with an i, so it rightly surmises it's an Italian name. Nice.

My favorite name in the list is Þorsteinn Gylfason, converted from all lowercase letters, þorsteinn gylfason. (See þorn.info.)