|Problems? Is your data what you think it is?|
Re^4: how to unicode filenames?by perl-diddler (Hermit)
|on Jun 28, 2012 at 01:00 UTC||Need Help??|
"where your idea of assuming that "all filenames are UTF-8" breaks, "Would you stop making false claims about what people say? I made no assertions that ALL filenames are UTF-8" Do you do that deliberately: quote people slightly out of context to start an argument? Or is is really accidental? I said:
... and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this? You'd think -CSD would have given it a hint that all Data is to be considered UTF-8 encoded..
Most linux systems use UTF-8 as native encoding, is not anywhere even close to "all filenames are UTF-8".... Now I could say all HTML5 defaults to UTF-8 encoding, and that would be correct. But filenames on linux are just stringZ's, That means you can put just about anything in them. Since most distro's are using UTF-8 these days. AFAIK, there are no current, *mainstream* Linux filesystems that don't support UTF-8. NTFS/Win32 don't count as mainstream linux filesystems, though NTFS supports any character in a file name (including NULL's), as the NTFS file calls take a length-based filename (it's the Win32 calls that put character limits on filenames -- and registry keys...)..
Sigh....I think a wrapper around 'find' (-depth 1) might be the easiest, -- but I take input from readdir and try to determine if it is a file or a dir, so I pass the names directly to -f/-d, and it doesn't work. You'd think with the chars actually having attributes, it wouldn't run a conversion on them to LATIN1 before -f/-d -- i.e. if they were read in as byte strings, they should be passed as bytestrings to -f/-d... that should work fine. But Perl changes the encoding and does so incorrectly. So yeah, I'd call that the perl unicode bug -- EXTREME!!...
The problem comes down to the 128-255 range, where some perl developers are under the mistaken impression that such characters are UTF-8 compatible as is -- they are not. All characters over 127 require 2 or more characters to represent them.
There is even crap in the perl documents that UTF-8 documents need to have a BOM -- something that goes against the Unicode standard (only MS has such requirements).
The fix is simple -- if someone is in a Unicode/UTF-8 locale, then any char with the high bit set is a multi-byte character (2-4 bytes). In fact, ALL UTF-8 bytes > 127 have the high bit set. 0x80 is encoded as 0xc2,0x80, and 0xff is encoded as 0xc3,0xbf, all read left-to right(low-to-high). There is no endian issue with UTF-8, thus no need for a BOM. The program 'file' on linux does a pretty good job (though not perfect) of categorizing something as UTF-8 or ASCII...
Start with ... as of perl 5.X.0 (x>=18), perl treats defaults to treating high bit set bytes, as already UTF-8 encoded and doesn't "upgrade them" from a provincial locale. To get old behavior, use "xxxx" (default to locale)...