Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

directories and charsets

by soliplaya (Beadle)
on Mar 15, 2007 at 11:30 UTC ( #604964=perlquestion: print w/replies, xml ) Need Help??
soliplaya has asked for the wisdom of the Perl Monks concerning the following question:

Dear perl Monks,

this is a question related to the character sets of directory entries in various circumstances and access to ditto from perl in various circumstances, and probably with the internal utf8 flag of Perl, the Encode module and so on.

Suppose that in a network, I have two interconnected hosts with on each perl 5.8.7 installed :
- hostU is a Unix/Linux host
- hostW is a Windows host

On each of these hosts, I have a "local" and a "remote" directory :
- "local" is a real, physically local directory. e.g. :
- on hostW, it is a directory "C:\mydir"
- on hostU, it is a directory "/var/localdir"
- "remote" is a directory mounted from the other system, e.g. :
- on hostW, I mount a "drive" Z:, which is really a directory "/var/share/driveZ" of hostU, shared via Samba.
- on hostU, I use SMBFS or CIFS to mount on "/mnt/hostW/dirW", a "share" defined on hostW. Suppose on hostW, this is really the directory "C:\share\dirW".

Now on each of these 4 directories, some user (connecting from a remote location in some way) creates a new file named "München" (just to have that nasty u umlaut character in the name). We now thus have 4 directories, with in each a file which for the user who created it looks like having the name "München".

Now suppose a perl script, the same on each of the machines hostU and hostW, which reads each of the 2 directories to which it has access (his own "local" and his own "remote"). It reads the directory by using the perl opendir() and readdir() functions.

And the question is : how should the script handle whatever it receives back from the readdir() function, in terms of the content ? And what will be this content in each of the 4 cases above ? In other words, will perl return string entries which have or have not the internal "utf8" flag set, and will/will not the string contain 1-byte or 2-byte representations of the u umlaut ?

Believe it or not, I have read the entire utf-8 related perl documentation, and the Encode module, and I have even tried the above in real perl scripts, without being able to make sense of the results. I get what looks to me like strange results, such as being able to read a directory entry with $filename = readdir(), but then getting "false" from a (-f $filename) test.

I thank in advance for any light and wisdom on the matter.

Replies are listed 'Best First'.
Re: directories and charsets
by Moron (Curate) on Mar 15, 2007 at 13:29 UTC
    A couple of possible traps:

    1) readdir returns only the filename whereas -f needs (in general) the path appended to the front. Example:

    sub Traverse { my $dir = shift; opendir my $dh, $dir or return 0; for my $file ( grep !/^\.\.?$/, readdir $dh ) { my $path = "$dir/$file"; if ( -d $path ) { Traverse( $path, @_ ); } elsif ( -f $path ) { ProcessFile( $path, @_ ); } else { warn "$path not -d or -f\n"; } } closedir $dh; }
    2) carriage control will be different but there is a dos program called unix2dos as well as a unix program called dos2unix that translates accordingly. For example:
    sub ProcessFile { my ( $path, $windows ) = @_; my $pid = 0; if ( $windows ) { open \*FH, "<$path" or die "$!: $path"; } else { $pid = open \*FH, "dos2unix $path |" or die "$!: dos2unix $path |"; } my $fh = \*FH; # file content (<$fh>) is now unix/dos transparent close $fh; $pid and waitpid $pid,0; }


    Free your mind

      Thank you for the Traverse sub.

      Frustratingly this sub(), when traversing my file tree, works fine, while my own humongous program doesn't and chokes on any filename that contains "non-ascii" characters, and I cannot as yet find what I am doing different.

      Except that I am doing quite a bit of concatenation of strings and storage in intermediate variables, and I suppose that somewhere perl's smart handling of automatic internal string conversion to utf8-when-needed is biting me.

      My real program scans several file trees, remembers in each the oldest file (storing the path in a table), then sorts the table on some formula, and passes the "best" path to another portion of the program for processing. It is in that other part of the program, when trying to use the path to open the file, that the problem occurs.
      What I imagine is that somewhere along the line, what was initially read as a "bytes" entry, becomes an internal "utf8" string, and then open() or stat() do not recognise that filename anymore. Does that make sense ?

        The Traverse sub does plenty of assigning of the filenames too - readdir has to work on these first, then its loaded into $file then that forms part of $path, so it can't be that!

        I strongly suggest using perl -d on your program - you can then use the x, W, and b and r commands to track down what is really happening to this data.

        Apart from that, I would say that Traverse works because it's too simple not to :)


        Free your mind

Re: directories and charsets
by jbert (Priest) on Mar 15, 2007 at 15:10 UTC
    I don't know about the problems you're having with network filesystems, other than the fact that it is the job of the filesystem to convert as necessary. Windows NTFS is going to be using 2-byte UCS-2 (a close relative of UTF-16) to store the filenames on disk, but Linux generally uses utf8 filenames.

    That, however, is the job of smbfs and samba to sort out, though.

    You should just be able to read and write utf8 filenames, as in the code below, however I get failed tests for #13 and 15. This is presumably because the filenames returned from 'glob' and 'readdir' *don't* have the utf8 flag on.

    Do any monks have some more info on this? If I read a filename from a utf8 filesystem, should the filename have the utf8 flag on? (ASCII-exception permitting, of course).

    perl 5.8.8

    #!/usr/bin/perl use strict; use warnings; use Test::More(tests => 14); use Encode; binmode STDOUT, ':utf8'; # If you have a UTF-8 terminal my $workdir = "./tt"; mkdir $workdir; # Let it fail if it already exists # This is a byte sequence, not tagged as utf8 to perl # so theoretically perl should consider it to be in the local # encoding, normally latin1 my $place = "M\xc3\xbcnchen"; test_placename($workdir, $place); # Turn on the flag for this scalar. Since we pre-arranged for # the byte sequence of this scalar to contain valid utf8, this # scalar is now a valid perl unicode string. Encode::_utf8_on($place); test_placename($workdir, $place); exit 0; sub test_placename { my $workdir = shift; my $place = shift; my $fname = "$workdir/$place"; my $fh; ok(!-f $fname, "$fname doesn't already exist"); open($fh, ">", $fname) or die "Can't create $fname : $!"; close $fh; ok(1, "can create $fname with 'open'/close"); ok(-f $fname, "can find $fname with -f"); my @files = glob("$workdir/$place"); is(scalar @files, 1, "One file in dir via glob"); is($files[0], $fname, "and it's what we expect"); my $dh; opendir $dh, $workdir or die "Can't open $workdir : $!"; @files = grep { !/^\./ } readdir $dh; closedir $dh; is(scalar @files, 1, "One file in dir via readdir"); is($files[0], $place, "and it's what we expect"); my $num_files_unlinked = unlink($fname); is($num_files_unlinked, 1, "can remove $fname"); }
      Yesssss, thank you !

      That is exactly the kind of problem I was talking about in my first convoluted message.

      From the documentation (perl Unicode etc..) and from my personal tests, it would seem that readdir() always returns strings that are "bytes" (not internally marked as "utf8" by Perl). This is per the Encode::is_utf8($dir_entry) function.

      However, at some point it seems that after concatenating that directory entry with, for instance, the directory path whence it comes, and trying a "if (-f $fullpath)", the answer is false.

      I was now testing on a Windows machine, and I thought that Windows NTFS was storing filenames as UTF-8. But you seem to say that this is not true, and that it is UCS-2 instead. That might explain why, when trying various permutations and encodings or decodings of my filenames, I am getting errors.

      Back to testing thus, with this exciting new possibility..

        If you concatenate a utf8-tagged string with a non-utf8 tagged string, perl will silently "upgrade" the non-utf8 string to utf8, converting it under the assumption that it is in the local encoding (normally latin1, but might be settable with locale, PERL_ENCODING env or similar).

        There is a module to warn you when this happens (can't remember what it's called though).

        If such an untagged string already contains utf8 byte sequences, this will give you an incorrect double-encoding of the string.

        It seems to me that one way to get the right behaviour is to do:

        my @files = map { Encode::_utf8_on($_); } readdir DIRHANDLE;
        when reading names from a utf8-named-filesystem.

        I could be wrong on the NTFS thing, it's just that UCS-2 (UTF-16-a-like) is *very* entrenched on Windows, I'd be very surprised if NTFS wasn't using that as it's native format. (Of course, you may well see it as utf8 when you mount the share with smbfs, I'd expect smbfs to do that translation for you, but maybe it's a mount option or something).

      I believe here is the shortest expression of what the problem might be :
      #!/usr/bin/perl use strict; use strict; use warnings; use Encode; my $topdir; if (scalar(@ARGV)) { $topdir = shift @ARGV; } else { print "Enter top dir : "; $topdir = <>; chomp $topdir; } warn("top directory [$topdir] : ",(Encode::is_utf8($topdir) ? '(ut +f8)' : '(bytes)')); unless (opendir(DIR,$topdir)) { die ("Could not open it : $!"); } closedir DIR; warn "everything ok"; exit 0;
      If you try this in a Windows command-line, after creating a directory with a non-ascii character in the name (suppose "München" for a change), and try it consecutively as :
      perl dirname
      you should see the kind of problem I'm having.

      This might be the deep cause of my problems, because in the real program, I am getting the name of the top directory of my tree by parsing a parameter file, and they come to perl as utf8 strings. But the subdirectory names that I read from the disk, come in as bytes. Now when I concatenate both to get a full filename, I believe I have a problem.

        Sorry, don't have a windows perl to hand.

        I agree that the problem is your subdirectory names coming in bytes. You need to know their charset, then call Encode::decode to map them from the appropriate charset (probably utf8 or UCS-2) into perl characters.

        If you hex dump the bytes and take a look on you should be able to work out what encoding you're getting back from readdir on the different platforms. Then do:

        my $encoding = "xxx"; # Probably 'UTF-8' or 'UTF-16LE' for windows my @files = map { Encode::decode($encoding, $_) } readdir DIR;
        Your scalars in @files will then be kosher perl unicode strings, and when they are concatenated with the unicode strings you are getting from your parameter file all should be well.

        Good luck.

Re: directories and charsets
by izut (Chaplain) on Mar 15, 2007 at 14:16 UTC

    I think it depends on how are you using the CIFS sharing. As far as I remember, Samba had some configuration about that. Traversing the directories should be transparent for you, AFAIK.

    About the utf8 flags and Encode, you should avoid using that. Use use utf8; if your program was written in utf8, just for that. Perl should translate that for you automatically on most cases.

    Igor 'izut' Sutton
    your code, your rules.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://604964]
Approved by Corion
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2018-07-22 05:11 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (451 votes). Check out past polls.