http://www.perlmonks.org?node_id=935347

nikosv has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, this is my first post after a long time

What is the state of the directory operators (opendir,readdir) regarding Unicode support?

I am under the impression that there is still no support for reading directories and files on a Unicode enabled file system.

For example on Win32, the non-Unicode directory operators get the data after being converted from UTF to ANSI using the System Default Page settings, although the filesystem is UTF16.

Thus to read Unicode dirs and files I am using Win32::COM which can read UTF8, although there is this module Win32-Unicode-0.26 which I have not tried.

So after this long introduction, the question is why those operators as opposed to the file-handling operations are not Unicode enabled (not necessarily by default but could be enabled by using a pragma), if that holds true for other OS's i.e Linux as well, and what are suggested workarounds

thanks

Replies are listed 'Best First'.
Re: Directory operations and Unicode
by moritz (Cardinal) on Nov 02, 2011 at 10:40 UTC

    The problem is that Win32 seems to be the only platform that offers an Unicode-aware file system API.

    On Linux, file names are zero-byte terminated binary strings, the interpretation is left to the userland. One can guess based on the locale, or just assume something globally, but neither approach is robust.

    So perl continues to offer an experience that is equally bad on Unix and Win32. I'm not aware of any changes to this (planned or released), but them I'n not up to date with p5p either.

      The problem is that Win32 seems to be the only platform that offers an Unicode-aware file system API.

      And as far as I know, all perls compiled for Windows still use the "ANSI" API to access the file system (i.e. CreateFileA instead of CreateFileW). Completely switching Perl to use the "wide" API may have "interesting" side effects. A very obvious one would be readdir returning Unicode strings instead of byte strings. Will existing code be able to handle that?

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        And even if you make the wide character API only available with a pragma, you still need separate calling code for Windows and Unix, which was mostly my point.

Re: Directory operations and Unicode
by Anonymous Monk on Nov 02, 2011 at 11:56 UTC
    What is the state of the directory operators (opendir,readdir) regarding Unicode support?
    That's been in perltodo for how many years now? 8? 10? I'm too depressed to find out the exact number.

      Yep. I've found this problem recently while trying to write a script that would reparent (change the root of the folders with the files), check and let me fix my loads of /\.m3u8?/ files. Horrible mess. Plus you would not believe what does copying the folders from a unix filesystem do to names containing accentuated letters. And the names were in Spanish (latin1), I do not want to try to imagine what would happen if they were in Czech (latin2) :-(

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Directory operations and Unicode
by patcat88 (Deacon) on Nov 06, 2011 at 02:39 UTC
      When using Win32::COM I subsequently use the OS COM facilities, hence I bypass any Wide APIs, call the Scripting.FileSystemObject and access the filesystem in UTF:
      Win32::OLE->Option(CP => Win32::OLE::CP_UTF8); $obj = Win32::OLE->new('Scripting.FileSystemObject');
      and manipulate its methods, for example :
      $folder = $obj->GetFolder("."); $collection= $folder->{Files};
      If you want to keep your sanity do not start looking into the wide API's ! :)
        When using Win32::COM I subsequently use the OS COM facilities, hence I bypass any Wide APIs, call the Scripting.FileSystemObject and access the filesystem in UTF

        No. You don't bypass the Wide APIs, you wrap them using a ridiculously large stack of other APIs. After uselessly burning a lot of CPU cycles, the Scripting.FileSystemObject finally ends calling the Wide APIs.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)