Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^4: how to unicode filenames?

by perl-diddler (Hermit)
on Jun 28, 2012 at 01:00 UTC ( #978807=note: print w/replies, xml ) Need Help??

in reply to Re^3: how to unicode filenames?
in thread how to unicode filenames?

"where your idea of assuming that "all filenames are UTF-8" breaks, "
Would you stop making false claims about what people say? I made no assertions that ALL filenames are UTF-8" Do you do that deliberately: quote people slightly out of context to start an argument? Or is is really accidental? I said:
... and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this? You'd think -CSD would have given it a hint that all Data is to be considered UTF-8 encoded..

Most linux systems use UTF-8 as native encoding, is not anywhere even close to "all filenames are UTF-8".... Now I could say all HTML5 defaults to UTF-8 encoding, and that would be correct. But filenames on linux are just stringZ's, That means you can put just about anything in them. Since most distro's are using UTF-8 these days. AFAIK, there are no current, *mainstream* Linux filesystems that don't support UTF-8. NTFS/Win32 don't count as mainstream linux filesystems, though NTFS supports any character in a file name (including NULL's), as the NTFS file calls take a length-based filename (it's the Win32 calls that put character limits on filenames -- and registry keys...)..

Sigh....I think a wrapper around 'find' (-depth 1) might be the easiest, -- but I take input from readdir and try to determine if it is a file or a dir, so I pass the names directly to -f/-d, and it doesn't work. You'd think with the chars actually having attributes, it wouldn't run a conversion on them to LATIN1 before -f/-d -- i.e. if they were read in as byte strings, they should be passed as bytestrings to -f/-d... that should work fine. But Perl changes the encoding and does so incorrectly. So yeah, I'd call that the perl unicode bug -- EXTREME!!...

The problem comes down to the 128-255 range, where some perl developers are under the mistaken impression that such characters are UTF-8 compatible as is -- they are not. All characters over 127 require 2 or more characters to represent them.

There is even crap in the perl documents that UTF-8 documents need to have a BOM -- something that goes against the Unicode standard (only MS has such requirements).

The fix is simple -- if someone is in a Unicode/UTF-8 locale, then any char with the high bit set is a multi-byte character (2-4 bytes). In fact, ALL UTF-8 bytes > 127 have the high bit set. 0x80 is encoded as 0xc2,0x80, and 0xff is encoded as 0xc3,0xbf, all read left-to right(low-to-high). There is no endian issue with UTF-8, thus no need for a BOM. The program 'file' on linux does a pretty good job (though not perfect) of categorizing something as UTF-8 or ASCII...

Start with ... as of perl 5.X.0 (x>=18), perl treats defaults to treating high bit set bytes, as already UTF-8 encoded and doesn't "upgrade them" from a provincial locale. To get old behavior, use "xxxx" (default to locale)...

Replies are listed 'Best First'.
Re^5: how to unicode filenames?
by Corion (Pope) on Jun 28, 2012 at 07:05 UTC

    I'm sorry - I misunderstood your sentence

    ...and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this?

    as relevant to the problem in the sense that you wanted Perl to automatically decode all filenames from UTF-8. I assumed you wanted an automagic solution becuase certainly you are aware of Encode and the common way of simply decoding all filenames from the filesystem by calling decode('UTF-8', $filename).

    If you can post some code where Perl actually munges the filename encoding, that would be interesting, because I am unaware of a situation where

    opendir my $dir, '.'; for my $ent ( readdir $dir ) { if(! -e $ent ) { print "'$ent' is read but does not exist" } };

    produces any output. Of course, once you mix data from other sources than the file system and data from the filesystem functions, you need to be aware of the respective encodings and convert between them, but in absence of a short, self-contained example of Perl code (plus the type of file system and the filename, as hex), it's hard to advise you better.

    You continue your post with a paragraph that starts with

    The fix is simple...

    I'm not sure how your fix would address the problems you encounter, and how your fix would maintain backwards compatibility. Feel free to posit your ideas to the perl5-porters mailing list, or even better, supply working code, as that has the greatest chance of moving Perl in the direction you seem to want.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://978807]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2018-05-25 03:44 GMT
Find Nodes?
    Voting Booth?