Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

how to unicode filenames?

by perl-diddler (Hermit)
on Jun 27, 2012 at 09:17 UTC ( #978596=perlquestion: print w/replies, xml ) Need Help??
perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

This prog demos the problem:
> 'ls' -1|tail -6 |perl -CSD -e'use 5.14.0; while (<>) { print } print "opening dir\n"; opendir(my $dh, "."); my @files = grep { /^[^.]/ } readdir $dh; my @sfiles=sort @files; my $start= @sfiles-6; for (my $i=$start; $i<@sfiles;++$i) { printf "%s\n", ${sfiles[$i]}; } ' zwadobef.ttf -chan.ttf &#12415;&#12363;&#12385;&#12419;&#12435;-p.ttf &#12415;&#12363;&#12385;&#12419;&#12435;-pb.ttf &#12415;&#12363;&#12385;&#12419;&#12435;-ps.ttf &#12415;&#12363;&#12385;&#12419;&#12435;.ttf opening dir zwadobef.ttf &#156;-chan.ttf み&#139;ち&#130;&#131;&#130;&#147;-p.ttf み&#139;ち&#130;&#131;&#130;&#147;-pb.ttf み&#139;ち&#130;&#131;&#130;&#147;-ps.ttf み&#139;ち&#130;&#131;&#130;&#147;.ttf
The output read in from STDIN is correct -- how do I get the output from readdir to be correct -- NOTE: the filenames returned by readdir aren't usable (i.e. testing them with "-f" or such returns "no such file"....)...

Unfortunately you'll have to imagine how this would look, since <code> doesn't protect unicode chars... it encodes them. The first bit of output doesn't look that way on a terminal... it outputs japanese hiragana characters...

Where does someone file a bug against perlmonks?...sigh...

Replies are listed 'Best First'.
Re: how to unicode filenames?
by Corion (Pope) on Jun 27, 2012 at 09:23 UTC

    File systems and functions are not encoding clean. On Linux and many other unixish systems, the filename string gets passed through "raw" from the filesystem driver, and the receiving userspace application has to decide on the encoding of the filename. See the "Bugs" section of utf8.

    Also see unicode version of readdir, directories and charsets

      So the problem has been around for 5 years -- and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this?

      You'd think -CSD would have given it a hint that all Data is to be considered UTF-8 encoded......

      This is highly gross.

        Luckily, Perl is not restricted to Linux.

        Feel free to implement the appropriate layer for Linux - perltodo lists the relevant functions and some thoughts that need to be considered. Especially there are filesystems where your idea of assuming that "all filenames are UTF-8" breaks, mount for example lists the various encodings that a filesystem can provide, and usually these are passed through by the various layers straight to your application.

Re: how to unicode filenames?
by zentara (Archbishop) on Jun 27, 2012 at 10:30 UTC
    The output read in from STDIN is correct -- how do I get the output from readdir to be correct -

    Try this:

    #this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); # or in one step @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;

    There also is the utf8::all module.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      I'll certainly give it a try... if it performs well, I'd prefer it over my hack-around of calling 'find' (as perl doesn't have a problem with the filenames if they come in on STDIN)...

      do I want to 'decode' utf8? the utf8 page mentions something about using Encode -- basically I don't need it to be decoded as much as to just be "relabeled" in place as already being "UTF-8-ified"...

      Thanks again! (which I could give more than one positive vote to people who are really helpful -- considering the vast difference when compared to those who are just contrarian)...

        Hi, I would like to share my Unicode battles with you, since we both are fighting the same battle it seems. After a few unicode related posts, yours being one of them, I decided to try and make a little utility I wrote, named vgrep, unicode aware. It was quite a hit or miss transformation. See Gtk2 Visual Grep

        I has to add the -CS perlrun switch, use the unicode::all module, and even after all that, I still needed to use $Encode::decode() in many places to get the desired output.

        Even though my linux filesystem locale is en_US.UTF-8 in my .bashrc, I still needed to run input strings and filenames thru decode. I'm using Perl 5.14.1.

        It works, but it definitely seems to my sensibilities that it should be simpler. I guess the problem comes from having many files and filenames comng in thru the net, and left over from previous Latin-1 linux installations, which are not UTF-8.

        The general rule I seem to be seeing is "treat all input as binary" then decode. My vgrep program still emits some errors when searching thru pdf files, which are detected as being -t text, but contain binary images; and I don't understand why File::Find dosn't automatically see unicode filenames, without having to decode $File::Find::name.

        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh
Re: how to unicode filenames?
by Anonymous Monk on Jun 27, 2012 at 09:29 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://978596]
Approved by moritz
and a kettle whistles...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2018-05-22 00:44 GMT
Find Nodes?
    Voting Booth?