http://www.perlmonks.org?node_id=604964

soliplaya has asked for the wisdom of the Perl Monks concerning the following question:

Dear perl Monks,

this is a question related to the character sets of directory entries in various circumstances and access to ditto from perl in various circumstances, and probably with the internal utf8 flag of Perl, the Encode module and so on.

Suppose that in a network, I have two interconnected hosts with on each perl 5.8.7 installed :
- hostU is a Unix/Linux host
- hostW is a Windows host

On each of these hosts, I have a "local" and a "remote" directory :
- "local" is a real, physically local directory. e.g. :
- on hostW, it is a directory "C:\mydir"
- on hostU, it is a directory "/var/localdir"
- "remote" is a directory mounted from the other system, e.g. :
- on hostW, I mount a "drive" Z:, which is really a directory "/var/share/driveZ" of hostU, shared via Samba.
- on hostU, I use SMBFS or CIFS to mount on "/mnt/hostW/dirW", a "share" defined on hostW. Suppose on hostW, this is really the directory "C:\share\dirW".

Now on each of these 4 directories, some user (connecting from a remote location in some way) creates a new file named "München" (just to have that nasty u umlaut character in the name). We now thus have 4 directories, with in each a file which for the user who created it looks like having the name "München".

Now suppose a perl script, the same on each of the machines hostU and hostW, which reads each of the 2 directories to which it has access (his own "local" and his own "remote"). It reads the directory by using the perl opendir() and readdir() functions.

And the question is : how should the script handle whatever it receives back from the readdir() function, in terms of the content ? And what will be this content in each of the 4 cases above ? In other words, will perl return string entries which have or have not the internal "utf8" flag set, and will/will not the string contain 1-byte or 2-byte representations of the u umlaut ?

Believe it or not, I have read the entire utf-8 related perl documentation, and the Encode module, and I have even tried the above in real perl scripts, without being able to make sense of the results. I get what looks to me like strange results, such as being able to read a directory entry with $filename = readdir(), but then getting "false" from a (-f $filename) test.

I thank in advance for any light and wisdom on the matter.