Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: unicode version of readdir

by demerphq (Chancellor)
on Sep 14, 2007 at 10:04 UTC ( #638983=note: print w/replies, xml ) Need Help??

in reply to unicode version of readdir

Ideally Perl should use the widechar interfaces internally by default, and return a utf8 string when needed, and otherwise return a latin-1 string. The problem as I recall is that the interface for these routines is char * with no flag for unicode filesystem semantics.

The problem here is that the default behaviour is based on the kludge of utf8 as employed by the *nix world. That being that you know your filenames are utf8 based on your locale, which is really a completely retarded idea, but backwards compatible with code that is not unicode aware.

Oh, and before any *nix zealot decides to lecture me on how much smarter the *nix solution is please go and read the history of the creation of utf8, it was specifically designed as a workaround for legacy computer systems (UNIX specifically) to handle unicode and was always intended to be replaced by better mechanisms at a later date. But workarounds have a nasty habit of lasting much much longer than most people realize.


Replies are listed 'Best First'.
Re^2: unicode version of readdir
by dk (Chaplain) on Sep 14, 2007 at 12:32 UTC
    I don't think that that would be an ideal implementation to use all-wide internal strings, because on win32 conversion from utf8 into byte using system locale might differ from windows internal conversions. Because, well, windows locales can be strange. Sure, there are ANSI<->OEM<->WideChar conversion functions, but it would be a bad idea to require perl programmer to use these when the actual conversion is needed. It would be easier (and less problematic) to assume that system-dependent filename conversion rules will be not guaranteed to be identical to perl utf8<->byte conversion rules.

    I'd say that a context-sensitive pp_readdir and friends would be just fine. In this case, when it expects utf8, on unix a simple SvUTF8_on() would be enough, and on win32 W-syscall instead of A-syscall would be issued. The question is how to determine the context.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://638983]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2018-05-27 01:41 GMT
Find Nodes?
    Voting Booth?