http://www.perlmonks.org?node_id=862243

eff_i_g has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm working on a Unix (Solaris) system with a file that has non-ASCII characters in its name. When I perform an ls the file name shows as 06_Protection_de_la_tête.xml and my locale is as follows:

LANG= LC_CTYPE=en_US.ISO8859-1 LC_NUMERIC=en_US.ISO8859-1 LC_TIME=en_US.ISO8859-1 LC_COLLATE=en_US.ISO8859-1 LC_MONETARY=en_US.ISO8859-1 LC_MESSAGES=C LC_ALL=

When I work with this file in Perl everything is OK until Tk (Tk::ExecuteCommand) enters the scene as shown/described below:

use warnings; use strict; use Tk; my $file = '06_Protection_de_la_tête.xml'; my $cmd = "ls -l $file"; ### This works. print qx($cmd); ### So does this. open my $F, '<', $file or die $!; print scalar <$F>; ### This fails. ### An error message states that the file cannot be found and the ### "LATIN SMALL LETTER E WITH CIRCUMFLEX" is shown in UTF-8. my $mw = MainWindow->new; my $exec = $mw->ExecuteCommand( -command => $cmd, )->pack; $exec->execute_command; $exec->update; MainLoop;

Here's the erroneous output:

06_Protection_de_la_tête.xml: No such file or directory

I've dug through Tk::ExecuteCommand and I cannot figure out why this error is surfacing.

Any ideas or pointers?

Thanks!

Replies are listed 'Best First'.
Re: Tk and Non-ASCII File Names
by zentara (Archbishop) on Sep 27, 2010 at 19:53 UTC
    You should use try to
    use utf8;
    and see if that works.

    But I had that problem before, and graph showed how to decode those pesky filenames like the following.

    #this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;
    or for a single file
    use Encode; my $file = decode('utf8', $file)
    from his explanation, that will tell Perl to see it as utf8, even if the filesystem didn't store it as such.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      zentara,

      use utf8; works with the code given; however, when I incorporate it into the larger program it does not work. I'm mimicking that in the posted script by replacing

      my $file = '06_Protection_de_la_tête.xml';
      with
      use File::Find::Rule; my @files = File::Find::Rule->file->name('*.xml')->in('.'); my $file = shift @files;

      If I follow this with

      use Encode; my $file = decode('utf8', $file)
      then all of the non-Tk lines break:
      06_Protection_de_la_t�te.xml: No such file or directory
      Argh!

      Do you suspect this to be Tk-related since the other commands work fine and there is an open bug related to this matter? Tk::ExecuteCommand builds the command by appending to the command I pass:

      $self->{-command} . ' 2>&1 |'

      Could this concatenation be changing the internal encoding of the string no matter what I send to the module? As I noted in my other reply, I can get all of this working if I modify a copy of the module and use utf8::downgrade, but I don't know if it's wise to change the one in production.

        Do you suspect this to be Tk-related since the other commands work fine and there is an open bug related to this matter?

        That sounds plausible, but hopefully an unicode expert like graff will weigh in. ( Maybe private msg graff and ask him to look at it? ) I'm a provincial american, who seldoms deals with non-ascii filenames. :-)

        I would first try reading the directories and printing the list to a Tk text box, and see if there is any name changes.


        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh
        I don't have any way to test in an environment that matches yours, but based on what you've posted so far, it would appear that your locale settings and non-ascii file names are "consistent" -- both involve a single-byte-per-character encoding for "vanilla" European (8859-1, i.e. Latin-1).

        So your Encode::decode call should specify that the string being passed to it needs to be decoded from that encoding:

        my $file = decode( 'iso-8859-1', $file );
        Try that and see if it helps. The return value should be a valid utf8 string with the accented "e" rendered as intended, because the value being passed in $file is a valid 8859-1 string.

        When you passed 'utf-8' as the first arg to decode(), perl was being told to expect utf8 data in $file, but the single non-ascii byte there was not parsable as utf8, and what you got in place of it was the unicode "REPLACEMENT CHARACTER" (U+FFFD), which, when rendered as utf8 data, is the three-byte sequence "0xef 0xbf 0xbd", and that sequence, when played through a Latin-1 display window, yields the three goofy characters that you got.

Re: Tk and Non-ASCII File Names
by perl-diddler (Chaplain) on Sep 27, 2010 at 17:44 UTC
    The first thing that comes to my mind is that Perl uses UTF-8. Are you converting from from your localized character set into UTF-8 (or vice-versa)? I.e. you string in 'my $file', will be takes as a UTF-8 string, If you want a non UTF-8 string, maybe (untested):
    { use locale; my $file== '06_Protection_de_la_tête.xml'; }
    Man page 'perllocale' for more info on this topic...
      Diddler,

      This sample is part of a larger program. I'm using File::Find::Rule to retrieve file names and then I incorporate them into commands and pass them through Tk::ExecuteCommand.

      I found this bug, but whatever I do (locale, encode, decode, utf8::upgrade, utf8::downgrade) doesn't work.

      I made a temporary library, copied Tk::ExecuteCommand into it, added utf8::downgrade to the command, and this works! Now I'm wondering if it's possible to get it working without changing the module or if it's wise to change the module itself.