Re: unicode version of readdir
by demerphq (Chancellor) on Sep 14, 2007 at 10:04 UTC
|
Ideally Perl should use the widechar interfaces internally by default, and return a utf8 string when needed, and otherwise return a latin-1 string. The problem as I recall is that the interface for these routines is char * with no flag for unicode filesystem semantics.
The problem here is that the default behaviour is based on the kludge of utf8 as employed by the *nix world. That being that you know your filenames are utf8 based on your locale, which is really a completely retarded idea, but backwards compatible with code that is not unicode aware.
Oh, and before any *nix zealot decides to lecture me on how much smarter the *nix solution is please go and read the history of the creation of utf8, it was specifically designed as a workaround for legacy computer systems (UNIX specifically) to handle unicode and was always intended to be replaced by better mechanisms at a later date. But workarounds have a nasty habit of lasting much much longer than most people realize.
---
$world=~s/war/peace/g
| [reply] |
|
I don't think that that would be an ideal implementation to use all-wide internal strings, because on win32 conversion from utf8 into byte using system locale might differ from windows internal conversions. Because, well, windows locales can be strange. Sure, there are ANSI<->OEM<->WideChar conversion functions, but it would be a bad idea to require perl programmer to use these when the actual conversion is needed. It would be easier (and less problematic) to assume that system-dependent filename conversion rules will be not guaranteed to be identical to perl utf8<->byte conversion rules.
I'd say that a context-sensitive pp_readdir and friends would be just fine. In this case, when it expects utf8, on unix a simple SvUTF8_on() would be enough, and on win32 W-syscall instead of A-syscall would be issued. The question is how to determine the context.
| [reply] |
Re: unicode version of readdir
by sgt (Deacon) on Sep 14, 2007 at 12:24 UTC
|
Well I can certainly qualify as a un*x-oriented monk (but I am no zealot;) IMVHO here we are talking filesystem user interface (and only that -- meaning using ONE env var for TWO (or more purposes) does not feel right):
- latin-1-like-aware filesystems have "forbidden chars" (NTFS quite a lot I think). What does happen with unicode-aware filesystems?
- Perl should ideally allow a default that makes sense for the platform but IFF the default is sensible. It's better if the default is the same for for all OS. If there is no consensus, then some kind of pragma will be needed like 'use feature filesystem qw(unicode-aware ...)'
- Feature can be promoted to default once unicode-aware filesystems are really stable y/or their semantics/implications clearly understood
- There is always the option of supporting an OS-dependent version of the needed calls (for unicode-aware filesystem). Not ideal but often necessary in the case of conflicting abstractions.
cheers
--stephan
| [reply] |
|
Oh. I'm absolutely against of making unicode-aware behavior by default, be it even really really stable. Filesystem business is external to perl, as well as locales and IO. Actually, perl does really well with the latter two - one says explicitly "use locale" and one uses IO layers, and that's it. What I propose is some orthogonality in this regard, say "use feature filesystem => 'utf8'" and enjoy utf8 input from readdir, and possibly from open.
update: 'use feature' and the like are global, possibly a three-arg 'opendir' or two-arg 'readdir' would be better? Or, like 'binmode FILEHANDLE', something like 'utf8mode DIRHANDLE' would be more interesting?
| [reply] |
|
Personally i disagree with this. Id like to see this behaviour be automatic. Both VMS and Win32 support filesystems that properly support unicode file names, unlike the hacky approach of the UNIX world. Since we *can* know at a system level how the data is encoded UTF16LE there is no problem mapping that data to and from UTF8. The issue you mentioned elsewhere of codepages doesnt play a role as far as i understand it (which could just mean I misunderstand it :-). Unicode data is unicode data. Codepages only play a role when you want to translate a unicode string into your current "locale", but if you stay in unicode the entire time it shouldnt matter.
The problem here is that the internal interfaces are designed around an exceptionally simple interface due to legacy reasons. However were someone to put in the work to change all of these interfaces to deal with SV's and not char *'s then we would be prepared for the existing OS/filesystems that can handle unicode filenames properly as well as for when some *nix file system does the same thing.
If you check the archives you will see this subject has come up before and that Jan Dubois (of Activestate) has opined on various pathways to resolve it, however they are large scope projects which are not within scope of 5.10, although could easily be in scope for 5.12.
---
$world=~s/war/peace/g
| [reply] |
|
Re: unicode version of readdir
by zentara (Archbishop) on Sep 14, 2007 at 14:36 UTC
|
use Encode;
opendir( D, $path );
@datafiles = grep { -f }, readdir( D );
$_ = decode( 'utf8', $_ ) for ( @datafiles );
| [reply] [d/l] |
|
No, on win32 unfortunately it doesn't. Win32::readdir substitutes wide characters with '?' if these cannot be mapped to the current locale.
| [reply] |
|
Good to know -- thanks. (I'm not a Win32 user, so I wouldn't have known.)
But, do Win32 systems really use locale settings? Would that imply, for instance, that someone using one of the Win32 file systems (NTFS, FAT32 or whatever) could have file names with, say, CP1256 (Arabic) encoding, and someone else use CP1252 (Latin-1), and yet another person use UTF-16LE?
That would be hell...
(update: But... in that other thread referenced by zentara, he said that my code snippet worked for him... What's up with that?)
| [reply] |
|
|
Re: unicode version of readdir
by zentara (Archbishop) on Sep 15, 2007 at 12:30 UTC
|
| [reply] |
|
I have read this node before, thank you nevertheless. What I was trying to do is to avoid the sheer horror of using Win32::API by nicely mapping these functions to readdir().
| [reply] |
|
What I was trying to do is to avoid the sheer horror of using
Win32::API by nicely mapping these functions to readdir().
I think the only option you currently have which comes close to
that is to enable the deprecated compile-time option USING_WIDE in
win32.h in the Perl sources (which still works in 5.8.8, but won't any longer
in 5.10). Also see Japanese filenames and USING_WIDE in win32.h, if you haven't come across this thread yet.
| [reply] |
|
|
Just as a zen hacking edu-guess, have you tried graff's method with some more win32-ish , like
@datafiles = grep { -f }, readdir( D );
$_ = decode( 'utf16le', $_ ) for ( @datafiles );
See also Can't Find File When Non-ASCII Letters Appear in Path
| [reply] [d/l] |
|
Re: unicode version of readdir
by dk (Chaplain) on Sep 15, 2007 at 22:11 UTC
|
Big thanks to everyone who answered, I think I gathered the information needed. I didn't solve the problem, but at least I have an overview of what is possible and what is not.
Also, during the last two days I've tried to hack blead and it became apparent that it is not really easy to build unicode filename support in. I think the best syntax I can come up with for enabling unicode readdir would be
binmode(DIRHANDLE, ':utf8')
but it is far from trivial to implement that without introducing evil hacks.
My plan now is to try to come up with a minimal patch that would enable unicode support not only for readdir() but for all other filename-based functions. I guess there would be opposition from non-windows folks, but let's see how it turns out. As for now, if anyone thinks if it worth it (or if it isn't) please share your opinions.
| [reply] [d/l] |
|
| [reply] |
|
... if anyone thinks if it worth it (or if it isn't) please share your opinions.
Yes, I'd say it's definitely worth it. You could make a little girl
a little happier... if that does mean something to you :) —
Ok, seriously, if you have the time to look into this, please go ahead!
IMHO, some way to conveniently handle different filename encodings is
the only missing component in Perl's otherwise excellent support for
unicode and other encodings.
Personally, I would favor a generic, platform-independent solution,
where you can simply use a pragma to specify what encoding the
filenames are in (in a particular environment). Of course, one would
somehow have to account for the idiosyncrasies of the individual
platforms. For example, on Windows it would probably make sense to take
advantage of the magic that the wide-char conversion functions do offer
when fed with the right parameters. Apart from that, I think less magic
is more, i.e. no auto-detection of encodings and stuff like that...
Actually, I've been meaning to come up with a patch myself,
but I have to admit I haven't got around to it yet (and the temporary
workaround based on USING_WIDE in fact works quite well for what I was
primarily in need of...)
| [reply] |
|
but it is far from trivial to implement that without introducing evil hacks ... I guess there would be opposition from non-windows folks,
You are correct. And in common with many other caveats of using Perl on Win32, if you chose to follow through on this, you are in for a very tough time.
It is unfortunately the case that the innate, knee-jerk, anti-MS reaction to anything that might improve the lot of the win32-based Perl user will be negative. Unless you can demonstrate that there is absolutely no negative impact of your code upon *any other OS user*, anywhere, anytime...your patch is likely to be rejected.
Of course, there will likely be a few attempts by *nix users to refute this allegation. They will say that patches from win32 users are treated exactly the same as those from *nix users, and only rejected if they are not complete and thorough. They will, if pushed, explain the ridiculously high rejection rate as a symptom that win32 users and programmers are simply too stupid to produce high quality patches.
You may even get a rejection of this thesis from the 2 tame win32 developers that have been accepted into the fold. You know. Like the token black man in US films from the 1960s through the 1980s. Do not be fooled or appeased.
Or else, they will simply stay silent and hope that nobody notices.
Good luck.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
The problem with this particular issue (as ive said elsewhere) is that the interface for the routines (which ARE completely pluggable) does not include a way to pass the fact that the strings are unicode back to the calling code. They are all based around crude UNIX style char * interfaces.
So in this case I wouldnt expect the kind of thing you are referring to to come up, it will just be such a big job with such huge ramifications that it wont happen until 5.12 at least. :-(
I guess im one of the tame win32 users. Although anybody that knows me well knows that 'tame' is not the best description. ;-)
---
$world=~s/war/peace/g
| [reply] |
|
|
geez man. what is your problem?
do you have an inferiority complex, or suffer from paranoid delusions
| [reply] |