binmode STDOUT, ":utf8"; and umlauts

resistance has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: binmode STDOUT, ":utf8"; and umlauts by moritz (Cardinal) on Jun 22, 2008 at 19:41 UTC
`:utf8` as an output layer views all output to STDOUT as text strings, converting each codepoint to UTF-8. But the things you print to STDOUT aren't text strings, they are byte strings. The `ü` in your byte string is not one code point with value 195, but instead is (as hex) `c3 bc`. Now `:utf8` converts the codepoints `0xc3` and `0xbc` to UTF-8 and prints that. Not what you want. You should take care to either only use byte strings or only text strings. I wrote an article that explains that plus lots of background (there's also a German version if you happen to like that better - considering that you're using Umlauts as examples) If you want more documentation, read the perluniintro and perlunicode manual pages.	[reply] [d/l] [select]
Re: binmode STDOUT, ":utf8"; and umlauts by pc88mxer (Vicar) on Jun 22, 2008 at 20:21 UTC
Besides what moritz points out, there are some other issues present here that you should be aware of. String arguments passed to system calls are treated as binary strings. That is, if you provide a string to a system call, perl will actually pass its internal representation of the string, and that may not be what you want. For instance, this similarly looking code will probably not perform as expected: `my $file = "men".chr(0xdc); open(FIND, "find ... -name $file\|");` [download] As code-points `$file` and `"menü"` are exactly the same. However, the internal representation of the two strings could be very different. The work-around is to `use Encode` to explicitly ensure that the correct encoding is used: `use Encode; my $file = ...; open(FIND, "find ... -name ".encode("utf-8", $file)."\|");` [download] Whether or not "utf-8" is correct in this case depends on your OS, how you are using your file system and how `find` is going to interpret the argument.	[reply] [d/l] [select]
Re: binmode STDOUT, ":utf8"; and umlauts by ikegami (Patriarch) on Jun 22, 2008 at 19:50 UTC
Either `find` is not producing iso-latin-1 — Perl assumes everything is iso-latin-1 unless you tell it otherwise — or whatever is interpreting the output of your program isn't expecting UTF-8. Using `binmode` on `FIND` would address the first problem, and using the appropriate encoding on `STDOUT` would address the second.	[reply] [d/l] [select]
Re^2: binmode STDOUT, ":utf8"; and umlauts by massa (Hermit) on Jun 22, 2008 at 20:42 UTC
if this is a typical reasonably new Linux distro, it's UTF-8 by default, i.e., you are right. He should have done `open FIND, '-\|:encoding(utf8)', 'find /dev/sda2 -name menü'` [download]	[reply] [d/l]
Re^2: binmode STDOUT, ":utf8"; and umlauts by pc88mxer (Vicar) on Jun 22, 2008 at 22:19 UTC
`find` is simply producing whatever was put there. In Linux, file names are just byte strings which may be interpreted however the user wants. Thus, it is incumbent on the user to decide on a convention for encoding file names and then to stick to that convention. This demonstrates what's going on: `#!/usr/bin/perl system("/bin/rm abc*"); system("/bin/ls"); # no files begin with "abc" my $name = "abc".chr(128); open(FOO, ">", $name); close(FOO); my $name2 = $name.chr(256); chop $name2; if ($name eq $name2) { print "\$name and \$name2 are ", ($name ne $name2 ? "not " : ""), "equal as perl strings\n"; } open(BAR, ">", $name2); close(BAR); system("/bin/ls"); # shows two files beginning with "abc"` [download] perl is evidently passing its internal representation of `$name` and `$name2` to the operating system's `open()` routine, and the OS is simply using that sequence of bytes as the file name.	[reply] [d/l] [select]
Re: binmode STDOUT, ":utf8"; and umlauts by Juerd (Abbot) on Jun 22, 2008 at 19:48 UTC
http://www.perlfoundation.org/perl5/index.cgi?the_utf8_perlio_layer	[reply]
Re^2: binmode STDOUT, ":utf8"; and umlauts by ikegami (Patriarch) on Jun 23, 2008 at 08:28 UTC
Your own Unicode Advice is a little more strenuous about not using `_utf8_on` and `_utf8_off`, which is what `:utf8` boils down to.	[reply] [d/l] [select]
Re^2: binmode STDOUT, ":utf8"; and umlauts by Anonymous Monk on Jun 23, 2008 at 08:03 UTC
That wiki is unreadable without javascript	[reply]


Keep It Simple, Stupid
	PerlMonks