Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

binmode STDOUT, ":utf8"; and umlauts

by resistance (Beadle)
on Jun 22, 2008 at 19:33 UTC ( [id://693402]=perlquestion: print w/replies, xml ) Need Help??

resistance has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am using unix "find" command to find file containing umlaut character.

binmode STDOUT, ":utf8";
open( FIND, "find /media/sda2 -name menü |" );
while(<FIND>){print};


what I get with enabled utf8 ?
/media/sda2/www/menü

without utf8 is all correct.

can somebody explain this ?

Replies are listed 'Best First'.
Re: binmode STDOUT, ":utf8"; and umlauts
by moritz (Cardinal) on Jun 22, 2008 at 19:41 UTC
    :utf8 as an output layer views all output to STDOUT as text strings, converting each codepoint to UTF-8.

    But the things you print to STDOUT aren't text strings, they are byte strings. The ü in your byte string is not one code point with value 195, but instead is (as hex) c3 bc. Now :utf8 converts the codepoints 0xc3 and 0xbc to UTF-8 and prints that. Not what you want.

    You should take care to either only use byte strings or only text strings. I wrote an article that explains that plus lots of background (there's also a German version if you happen to like that better - considering that you're using Umlauts as examples)

    If you want more documentation, read the perluniintro and perlunicode manual pages.

Re: binmode STDOUT, ":utf8"; and umlauts
by pc88mxer (Vicar) on Jun 22, 2008 at 20:21 UTC
    Besides what moritz points out, there are some other issues present here that you should be aware of.

    String arguments passed to system calls are treated as binary strings. That is, if you provide a string to a system call, perl will actually pass its internal representation of the string, and that may not be what you want.

    For instance, this similarly looking code will probably not perform as expected:

    my $file = "men".chr(0xdc); open(FIND, "find ... -name $file|");
    As code-points $file and "menü" are exactly the same. However, the internal representation of the two strings could be very different. The work-around is to use Encode to explicitly ensure that the correct encoding is used:
    use Encode; my $file = ...; open(FIND, "find ... -name ".encode("utf-8", $file)."|");

    Whether or not "utf-8" is correct in this case depends on your OS, how you are using your file system and how find is going to interpret the argument.

Re: binmode STDOUT, ":utf8"; and umlauts
by ikegami (Patriarch) on Jun 22, 2008 at 19:50 UTC
    Either find is not producing iso-latin-1 — Perl assumes everything is iso-latin-1 unless you tell it otherwise — or whatever is interpreting the output of your program isn't expecting UTF-8.

    Using binmode on FIND would address the first problem, and using the appropriate encoding on STDOUT would address the second.

      if this is a typical reasonably new Linux distro, it's UTF-8 by default, i.e., you are right. He should have done
      open FIND, '-|:encoding(utf8)', 'find /dev/sda2 -name menü'
      find is simply producing whatever was put there. In Linux, file names are just byte strings which may be interpreted however the user wants. Thus, it is incumbent on the user to decide on a convention for encoding file names and then to stick to that convention.

      This demonstrates what's going on:

      #!/usr/bin/perl system("/bin/rm abc*"); system("/bin/ls"); # no files begin with "abc" my $name = "abc".chr(128); open(FOO, ">", $name); close(FOO); my $name2 = $name.chr(256); chop $name2; if ($name eq $name2) { print "\$name and \$name2 are ", ($name ne $name2 ? "not " : ""), "equal as perl strings\n"; } open(BAR, ">", $name2); close(BAR); system("/bin/ls"); # shows two files beginning with "abc"
      perl is evidently passing its internal representation of $name and $name2 to the operating system's open() routine, and the OS is simply using that sequence of bytes as the file name.
Re: binmode STDOUT, ":utf8"; and umlauts
by Juerd (Abbot) on Jun 22, 2008 at 19:48 UTC

      Your own Unicode Advice is a little more strenuous about not using _utf8_on and _utf8_off, which is what :utf8 boils down to.

      That wiki is unreadable without javascript

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://693402]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-20 01:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found