http://www.perlmonks.org?node_id=638972

dk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I was researching how it would be possible to receive results of readdir in utf8, and didn't find anything useful. The problem is that I want to read file names on win32 that contain non-latin character, that are mapped to '?' within my codepage.

I found that the problem was discussed before, but couldn't find any suitable solutions, jperl hacks being discontinued and Win32API::File not having FindFirst/FindNext entries.

I was thinking if it is indeed not possible, of introducing some switches in perl core that would trigger behavior of readdir between bytes and utf8. Next steps probably would be that open would recognize utf8 file names as well, but that's for later.

Another aspect is that the problem is wider than win32 - it is perfectly legal to create utf8 file names on unix file systems (of course one can always treat them as non-unicode names, which is not possible on win32); gnome utilities use this feature when run under UTF8 locales. The point is if someone a) explicitly knows that his files have utf8 names and b) wants them to be accessed with perl utf8 semantics and little hassle (and irrespective of the locale!), there's no way to do that except to mess with Encode.

So my questions are:
- Can (as of now) readdir return utf8 scalars?
- If not, is this a good idea to introduce such changes in core?
- If yes, what would be the most desirable format of the trigger? A new system var f.ex. $UTF8_FILENAMES or a new pragma like "use utf8 'filenames'" or "use utf8_filenames" or ...?

Thank you!

Replies are listed 'Best First'.
Re: unicode version of readdir
by demerphq (Chancellor) on Sep 14, 2007 at 10:04 UTC

    Ideally Perl should use the widechar interfaces internally by default, and return a utf8 string when needed, and otherwise return a latin-1 string. The problem as I recall is that the interface for these routines is char * with no flag for unicode filesystem semantics.

    The problem here is that the default behaviour is based on the kludge of utf8 as employed by the *nix world. That being that you know your filenames are utf8 based on your locale, which is really a completely retarded idea, but backwards compatible with code that is not unicode aware.

    Oh, and before any *nix zealot decides to lecture me on how much smarter the *nix solution is please go and read the history of the creation of utf8, it was specifically designed as a workaround for legacy computer systems (UNIX specifically) to handle unicode and was always intended to be replaced by better mechanisms at a later date. But workarounds have a nasty habit of lasting much much longer than most people realize.

    ---
    $world=~s/war/peace/g

      I don't think that that would be an ideal implementation to use all-wide internal strings, because on win32 conversion from utf8 into byte using system locale might differ from windows internal conversions. Because, well, windows locales can be strange. Sure, there are ANSI<->OEM<->WideChar conversion functions, but it would be a bad idea to require perl programmer to use these when the actual conversion is needed. It would be easier (and less problematic) to assume that system-dependent filename conversion rules will be not guaranteed to be identical to perl utf8<->byte conversion rules.

      I'd say that a context-sensitive pp_readdir and friends would be just fine. In this case, when it expects utf8, on unix a simple SvUTF8_on() would be enough, and on win32 W-syscall instead of A-syscall would be issued. The question is how to determine the context.

Re: unicode version of readdir
by sgt (Deacon) on Sep 14, 2007 at 12:24 UTC

    Well I can certainly qualify as a un*x-oriented monk (but I am no zealot;) IMVHO here we are talking filesystem user interface (and only that -- meaning using ONE env var for TWO (or more purposes) does not feel right):

  • latin-1-like-aware filesystems have "forbidden chars" (NTFS quite a lot I think). What does happen with unicode-aware filesystems?
  • Perl should ideally allow a default that makes sense for the platform but IFF the default is sensible. It's better if the default is the same for for all OS. If there is no consensus, then some kind of pragma will be needed like 'use feature filesystem qw(unicode-aware ...)'
  • Feature can be promoted to default once unicode-aware filesystems are really stable y/or their semantics/implications clearly understood
  • There is always the option of supporting an OS-dependent version of the needed calls (for unicode-aware filesystem). Not ideal but often necessary in the case of conflicting abstractions.
  • cheers --stephan
      Oh. I'm absolutely against of making unicode-aware behavior by default, be it even really really stable. Filesystem business is external to perl, as well as locales and IO. Actually, perl does really well with the latter two - one says explicitly "use locale" and one uses IO layers, and that's it. What I propose is some orthogonality in this regard, say "use feature filesystem => 'utf8'" and enjoy utf8 input from readdir, and possibly from open.

      update: 'use feature' and the like are global, possibly a three-arg 'opendir' or two-arg 'readdir' would be better? Or, like 'binmode FILEHANDLE', something like 'utf8mode DIRHANDLE' would be more interesting?

        Personally i disagree with this. Id like to see this behaviour be automatic. Both VMS and Win32 support filesystems that properly support unicode file names, unlike the hacky approach of the UNIX world. Since we *can* know at a system level how the data is encoded UTF16LE there is no problem mapping that data to and from UTF8. The issue you mentioned elsewhere of codepages doesnt play a role as far as i understand it (which could just mean I misunderstand it :-). Unicode data is unicode data. Codepages only play a role when you want to translate a unicode string into your current "locale", but if you stay in unicode the entire time it shouldnt matter.

        The problem here is that the internal interfaces are designed around an exceptionally simple interface due to legacy reasons. However were someone to put in the work to change all of these interfaces to deal with SV's and not char *'s then we would be prepared for the existing OS/filesystems that can handle unicode filenames properly as well as for when some *nix file system does the same thing.

        If you check the archives you will see this subject has come up before and that Jan Dubois (of Activestate) has opined on various pathways to resolve it, however they are large scope projects which are not within scope of 5.10, although could easily be in scope for 5.12.

        ---
        $world=~s/war/peace/g

Re: unicode version of readdir
by zentara (Archbishop) on Sep 14, 2007 at 14:36 UTC
      No, on win32 unfortunately it doesn't. Win32::readdir substitutes wide characters with '?' if these cannot be mapped to the current locale.
        Good to know -- thanks. (I'm not a Win32 user, so I wouldn't have known.)

        But, do Win32 systems really use locale settings? Would that imply, for instance, that someone using one of the Win32 file systems (NTFS, FAT32 or whatever) could have file names with, say, CP1256 (Arabic) encoding, and someone else use CP1252 (Latin-1), and yet another person use UTF-16LE?

        That would be hell...

        (update: But... in that other thread referenced by zentara, he said that my code snippet worked for him... What's up with that?)

Re: unicode version of readdir
by zentara (Archbishop) on Sep 15, 2007 at 12:30 UTC
      I have read this node before, thank you nevertheless. What I was trying to do is to avoid the sheer horror of using Win32::API by nicely mapping these functions to readdir().
        What I was trying to do is to avoid the sheer horror of using Win32::API by nicely mapping these functions to readdir().

        I think the only option you currently have which comes close to that is to enable the deprecated compile-time option USING_WIDE in win32.h in the Perl sources (which still works in 5.8.8, but won't any longer in 5.10). Also see Japanese filenames and USING_WIDE in win32.h, if you haven't come across this thread yet.

Re: unicode version of readdir
by dk (Chaplain) on Sep 15, 2007 at 22:11 UTC
    Big thanks to everyone who answered, I think I gathered the information needed. I didn't solve the problem, but at least I have an overview of what is possible and what is not.

    Also, during the last two days I've tried to hack blead and it became apparent that it is not really easy to build unicode filename support in. I think the best syntax I can come up with for enabling unicode readdir would be

    binmode(DIRHANDLE, ':utf8')

    but it is far from trivial to implement that without introducing evil hacks.

    My plan now is to try to come up with a minimal patch that would enable unicode support not only for readdir() but for all other filename-based functions. I guess there would be opposition from non-windows folks, but let's see how it turns out. As for now, if anyone thinks if it worth it (or if it isn't) please share your opinions.

      Unicode readdir is built into the next release of Win32API::File, but I'm unsure when I'll get the release finished. Then the syntax you'd use to get the Unicode version would be to import the Unicode version either as readdir or under some other name.

      - tye        

      ... if anyone thinks if it worth it (or if it isn't) please share your opinions.

      Yes, I'd say it's definitely worth it. You could make a little girl a little happier... if that does mean something to you :) — Ok, seriously, if you have the time to look into this, please go ahead!

      IMHO, some way to conveniently handle different filename encodings is the only missing component in Perl's otherwise excellent support for unicode and other encodings.

      Personally, I would favor a generic, platform-independent solution, where you can simply use a pragma to specify what encoding the filenames are in (in a particular environment). Of course, one would somehow have to account for the idiosyncrasies of the individual platforms. For example, on Windows it would probably make sense to take advantage of the magic that the wide-char conversion functions do offer when fed with the right parameters. Apart from that, I think less magic is more, i.e. no auto-detection of encodings and stuff like that...

      Actually, I've been meaning to come up with a patch myself, but I have to admit I haven't got around to it yet (and the temporary workaround based on USING_WIDE in fact works quite well for what I was primarily in need of...)

      but it is far from trivial to implement that without introducing evil hacks ... I guess there would be opposition from non-windows folks,

      You are correct. And in common with many other caveats of using Perl on Win32, if you chose to follow through on this, you are in for a very tough time.

      It is unfortunately the case that the innate, knee-jerk, anti-MS reaction to anything that might improve the lot of the win32-based Perl user will be negative. Unless you can demonstrate that there is absolutely no negative impact of your code upon *any other OS user*, anywhere, anytime...your patch is likely to be rejected.

      Of course, there will likely be a few attempts by *nix users to refute this allegation. They will say that patches from win32 users are treated exactly the same as those from *nix users, and only rejected if they are not complete and thorough. They will, if pushed, explain the ridiculously high rejection rate as a symptom that win32 users and programmers are simply too stupid to produce high quality patches.

      You may even get a rejection of this thesis from the 2 tame win32 developers that have been accepted into the fold. You know. Like the token black man in US films from the 1960s through the 1980s. Do not be fooled or appeased.

      Or else, they will simply stay silent and hope that nobody notices.

      Good luck.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        The problem with this particular issue (as ive said elsewhere) is that the interface for the routines (which ARE completely pluggable) does not include a way to pass the fact that the strings are unicode back to the calling code. They are all based around crude UNIX style char * interfaces.

        So in this case I wouldnt expect the kind of thing you are referring to to come up, it will just be such a big job with such huge ramifications that it wont happen until 5.12 at least. :-(

        I guess im one of the tame win32 users. Although anybody that knows me well knows that 'tame' is not the best description. ;-)

        ---
        $world=~s/war/peace/g

        geez man. what is your problem?
        do you have an inferiority complex, or suffer from paranoid delusions