Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: how to unicode filenames?

by zentara (Archbishop)
on Jun 27, 2012 at 10:30 UTC ( [id://978612]=note: print w/replies, xml ) Need Help??


in reply to how to unicode filenames?

The output read in from STDIN is correct -- how do I get the output from readdir to be correct -

Try this:

#this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); # or in one step @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;

There also is the utf8::all module.


I'm not really a human, but I play one on earth.
Old Perl Programmer Haiku ................... flash japh

Replies are listed 'Best First'.
Re^2: how to unicode filenames?
by perl-diddler (Chaplain) on Jun 28, 2012 at 01:04 UTC
    I'll certainly give it a try... if it performs well, I'd prefer it over my hack-around of calling 'find' (as perl doesn't have a problem with the filenames if they come in on STDIN)...

    do I want to 'decode' utf8? the utf8 page mentions something about using Encode -- basically I don't need it to be decoded as much as to just be "relabeled" in place as already being "UTF-8-ified"...

    Thanks again! (which I could give more than one positive vote to people who are really helpful -- considering the vast difference when compared to those who are just contrarian)...

      Hi, I would like to share my Unicode battles with you, since we both are fighting the same battle it seems. After a few unicode related posts, yours being one of them, I decided to try and make a little utility I wrote, named vgrep, unicode aware. It was quite a hit or miss transformation. See Gtk2 Visual Grep

      I has to add the -CS perlrun switch, use the unicode::all module, and even after all that, I still needed to use $Encode::decode() in many places to get the desired output.

      Even though my linux filesystem locale is en_US.UTF-8 in my .bashrc, I still needed to run input strings and filenames thru decode. I'm using Perl 5.14.1.

      It works, but it definitely seems to my sensibilities that it should be simpler. I guess the problem comes from having many files and filenames comng in thru the net, and left over from previous Latin-1 linux installations, which are not UTF-8.

      The general rule I seem to be seeing is "treat all input as binary" then decode. My vgrep program still emits some errors when searching thru pdf files, which are detected as being -t text, but contain binary images; and I don't understand why File::Find dosn't automatically see unicode filenames, without having to decode $File::Find::name.


      I'm not really a human, but I play one on earth.
      Old Perl Programmer Haiku ................... flash japh

        File::Find does not automatically "see" (or return) unicode filenames, because for Perl there is no way to know that what the file system APIs return is UTF-8-encoded text. If you are certain that this is always the case, I guess you can wrap your own decode() wrapper around it, but I see it breaking for many situations where different filesystems with different filename encodings come together.

        A reply falls below the community's threshold of quality. You may see it by logging in.
        A quick suggestion: You have -CS, but for a 'find', you might want to evaluate if -CSA would be a better choice for such a program.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://978612]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-03-29 10:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found