Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Tk and Non-ASCII File Names

by eff_i_g (Curate)
on Sep 27, 2010 at 16:36 UTC ( #862243=perlquestion: print w/ replies, xml ) Need Help??
eff_i_g has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm working on a Unix (Solaris) system with a file that has non-ASCII characters in its name. When I perform an ls the file name shows as 06_Protection_de_la_tête.xml and my locale is as follows:

LANG= LC_CTYPE=en_US.ISO8859-1 LC_NUMERIC=en_US.ISO8859-1 LC_TIME=en_US.ISO8859-1 LC_COLLATE=en_US.ISO8859-1 LC_MONETARY=en_US.ISO8859-1 LC_MESSAGES=C LC_ALL=

When I work with this file in Perl everything is OK until Tk (Tk::ExecuteCommand) enters the scene as shown/described below:

use warnings; use strict; use Tk; my $file = '06_Protection_de_la_tête.xml'; my $cmd = "ls -l $file"; ### This works. print qx($cmd); ### So does this. open my $F, '<', $file or die $!; print scalar <$F>; ### This fails. ### An error message states that the file cannot be found and the ### "LATIN SMALL LETTER E WITH CIRCUMFLEX" is shown in UTF-8. my $mw = MainWindow->new; my $exec = $mw->ExecuteCommand( -command => $cmd, )->pack; $exec->execute_command; $exec->update; MainLoop;

Here's the erroneous output:

06_Protection_de_la_tête.xml: No such file or directory

I've dug through Tk::ExecuteCommand and I cannot figure out why this error is surfacing.

Any ideas or pointers?

Thanks!

Comment on Tk and Non-ASCII File Names
Select or Download Code
Re: Tk and Non-ASCII File Names
by perl-diddler (Hermit) on Sep 27, 2010 at 17:44 UTC
    The first thing that comes to my mind is that Perl uses UTF-8. Are you converting from from your localized character set into UTF-8 (or vice-versa)? I.e. you string in 'my $file', will be takes as a UTF-8 string, If you want a non UTF-8 string, maybe (untested):
    { use locale; my $file== '06_Protection_de_la_tête.xml'; }
    Man page 'perllocale' for more info on this topic...
      Diddler,

      This sample is part of a larger program. I'm using File::Find::Rule to retrieve file names and then I incorporate them into commands and pass them through Tk::ExecuteCommand.

      I found this bug, but whatever I do (locale, encode, decode, utf8::upgrade, utf8::downgrade) doesn't work.

      I made a temporary library, copied Tk::ExecuteCommand into it, added utf8::downgrade to the command, and this works! Now I'm wondering if it's possible to get it working without changing the module or if it's wise to change the module itself.

Re: Tk and Non-ASCII File Names
by zentara (Archbishop) on Sep 27, 2010 at 19:53 UTC
    You should use try to
    use utf8;
    and see if that works.

    But I had that problem before, and graph showed how to decode those pesky filenames like the following.

    #this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;
    or for a single file
    use Encode; my $file = decode('utf8', $file)
    from his explanation, that will tell Perl to see it as utf8, even if the filesystem didn't store it as such.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      zentara,

      use utf8; works with the code given; however, when I incorporate it into the larger program it does not work. I'm mimicking that in the posted script by replacing

      my $file = '06_Protection_de_la_tête.xml';
      with
      use File::Find::Rule; my @files = File::Find::Rule->file->name('*.xml')->in('.'); my $file = shift @files;

      If I follow this with

      use Encode; my $file = decode('utf8', $file)
      then all of the non-Tk lines break:
      06_Protection_de_la_t�te.xml: No such file or directory
      Argh!

      Do you suspect this to be Tk-related since the other commands work fine and there is an open bug related to this matter? Tk::ExecuteCommand builds the command by appending to the command I pass:

      $self->{-command} . ' 2>&1 |'

      Could this concatenation be changing the internal encoding of the string no matter what I send to the module? As I noted in my other reply, I can get all of this working if I modify a copy of the module and use utf8::downgrade, but I don't know if it's wise to change the one in production.

        Do you suspect this to be Tk-related since the other commands work fine and there is an open bug related to this matter?

        That sounds plausible, but hopefully an unicode expert like graff will weigh in. ( Maybe private msg graff and ask him to look at it? ) I'm a provincial american, who seldoms deals with non-ascii filenames. :-)

        I would first try reading the directories and printing the list to a Tk text box, and see if there is any name changes.


        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh
        I don't have any way to test in an environment that matches yours, but based on what you've posted so far, it would appear that your locale settings and non-ascii file names are "consistent" -- both involve a single-byte-per-character encoding for "vanilla" European (8859-1, i.e. Latin-1).

        So your Encode::decode call should specify that the string being passed to it needs to be decoded from that encoding:

        my $file = decode( 'iso-8859-1', $file );
        Try that and see if it helps. The return value should be a valid utf8 string with the accented "e" rendered as intended, because the value being passed in $file is a valid 8859-1 string.

        When you passed 'utf-8' as the first arg to decode(), perl was being told to expect utf8 data in $file, but the single non-ascii byte there was not parsable as utf8, and what you got in place of it was the unicode "REPLACEMENT CHARACTER" (U+FFFD), which, when rendered as utf8 data, is the three-byte sequence "0xef 0xbf 0xbd", and that sequence, when played through a Latin-1 display window, yields the three goofy characters that you got.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://862243]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2014-09-19 06:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (132 votes), past polls