UTF-8 and readdir, etc.

jrw005 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: UTF-8 and readdir, etc.
by thanos1983 (Parson) on Jan 31, 2018 at 12:26 UTC

Welcome to the Monastery. I think you are looking for utf8 in directory and filenames. Read also dir/filenames and utf8 and also directories and charsets.

Hope this helps, BR.

Seeking for Perl wisdom...on the process of learning...not there...yet!

[reply]
[d/l]
[select]

Re: UTF-8 and readdir, etc.
by kcott (Archbishop) on Feb 01, 2018 at 02:28 UTC

G'day John,

Welcome to the Monastery.

"Maybe I've overlooked the obvious (if so I apologise)."

Showing us your "simple perl program" and describing your actual problem with example input, output, error messages, and so on, would probably result in a better answer. As it is, we need to fall back to guesswork. I appreciate this is your first post, and I'm not trying to beat you over the head with the rule book, but please read "How do I post a question effectively?" and "Short, Self-Contained, Correct Example" to find out what sort of information to post in order to get the best answers.

One minor deviation from the information given in that first link: use '<pre>' blocks, instead of '<code>' blocks, for presenting Unicode data outside the 7-bit ASCII range. With '<code>' blocks, your Unicode characters will typically end up being shown as entity references, i.e. something like '&#NNNNNN;'. This won't happen with '<pre>' blocks; however, the drawbacks are there's no "[download]" link, and you have to manually change special characters in your code and data (e.g. '<' and '&') to their entities (e.g. '<' and '&') — there's a list of these after the textarea where you write your post. For inline Unicode characters, e.g. inside a '<p>' or '<li>' block, I typically use '<tt>' instead of '<pre>': this is to avoid '<pre>' being forced into a block format by, for instance, a style sheet.

Other than the actual tree walking, which could be part of your problem, the following script ("pm_1208191_read_utf8_filenames.pl") performs the reading and writing tasks you specify.

#!/usr/bin/env perl -l

use strict;
use warnings;
use autodie;

my $dir = 'pm_1208191_utf8_filenames';
my $out = 'pm_1208191_utf8_filenames_listing.txt';

open my $fh, '>', $out;
opendir(my $dh, $dir);

print $fh $_ while readdir $dh;
[download]

Given this test directory I set up:

$ ls -al pm_1208191_utf8_filenames
total 0
drwxr-xr-x  7 ken staff 238 Feb  1 11:45 .
drwxr-xr-x 18 ken staff 612 Feb  1 11:34 ..
-rw-r--r--  1 ken staff   0 Feb  1 11:34 abc
-rw-r--r--  1 ken staff   0 Feb  1 11:36 å綔̧
-rw-r--r--  1 ken staff   0 Feb  1 11:38 αβγ
-rw-r--r--  1 ken staff   0 Feb  1 11:41 абг
-rw-r--r--  1 ken staff   0 Feb  1 11:45 ☿♃♄

Here's a sample run:

$ cat pm_1208191_utf8_filenames_listing.txt
cat: pm_1208191_utf8_filenames_listing.txt: No such file or directory
$ pm_1208191_read_utf8_filenames.pl
$ cat pm_1208191_utf8_filenames_listing.txt
.
..
abc
å綔̧
αβγ
абг
☿♃♄

As you can see, I didn't need any special encoding-type directives. I'm using Perl 5.26.0; MacOS 10.12.5; and I have 'LANG=en_AU.UTF-8' (normal setting).

In case you can't actually see some of those characters, here's a table of the filenames, the three codepoints used for each, and a link to the Unicode PDF code chart so you can see what they look like.

Filename	Codepoints	Code Chart (PDF link)
`abc`	`U+0061`, `U+0062`, `U+0063`	C0 Controls and Basic Latin
`å綔̧`	`U+00E5`, `U+00DF`, `U+00E7`	C1 Controls and Latin-1 Supplement
`αβγ`	`U+03B1`, `U+03B2`, `U+03B3`	Greek and Coptic
`абг`	`U+0430`, `U+0431`, `U+0433`	Cyrillic
`☿♃♄`	`U+263F`, `U+2643`, `U+2644`	Miscellaneous Symbols

Take a look at "Re: printing Unicode works for some characters but not all", which I wrote some months ago. This may shed some light on whatever problems you're encountering — clearly, this is one of those guesswork answers I mentioned earlier.

The open pragma statement you're looking for might be something like:

use open IO => qw{:encoding(UTF-8) :std};
[download]

Again, that's more guesswork as you haven't shown your script or adequately described your problem.

— Ken

[reply]
[d/l]
[select]

Re^2: UTF-8 and readdir, etc.

by Anonymous Monk on Feb 01, 2018 at 09:14 UTC

Some notes: Will this work if $dir has at least one character >= 128? I think it needs 'use utf8;' and encoding it to the filesystem encoding before opendir call. Will this work when all three encodings (filesystem, result file, code source file) aren't the same (Windows for example)? I doubt it

[reply]

Re^3: UTF-8 and readdir, etc.

by kcott (Archbishop) on Feb 02, 2018 at 01:22 UTC

"Will this work if $dir has at least one character >= 128?"

If "128" refers to the return value of "ord($character)" (see ord), my test data uses such characters. If you meant something else, please explain.

"I think it needs 'use utf8;'"

The source code I provided is written entirely using 7-bit ASCII characters. The utf8 pragma is definitely not required here. I suggest you read that documentation, paying particular attention to this part (which it shows in bold text):

"Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."

"Will this work when all three encodings (filesystem, result file, code source file) aren't the same (Windows for example)?"

The OP stated that "The host OS is Linux, and is configured to use UTF-8 for filenames; the contents of the output file are also encoded as UTF-8.".

— Ken

[reply]
[d/l]
[select]

Re^4: UTF-8 and readdir, etc.

by Anonymous Monk on Feb 02, 2018 at 18:51 UTC

Re: UTF-8 and readdir, etc.
by IB2017 (Pilgrim) on Jan 31, 2018 at 18:15 UTC

Hello irw005. This is the subrutine I use to read the file names in a directory with unicode names (use Win32::Unicode::Dir)

sub ReadDir{
    my $Directory=shift;
    my @Documents;
    print "Reading directory's content $Directory\n";
    my $wdir = Win32::Unicode::Dir->new;

    $wdir->open($Directory) || die;
    for ($wdir->fetch) {
        next if /^\.{1,2}$/;
        push (@Documents, $_)
    }
    $wdir->close || dieW $wdir->error;
    return \@Documents;
}
[download]

[reply]
[d/l]

Re^2: UTF-8 and readdir, etc.

by Anonymous Monk on Jan 31, 2018 at 18:39 UTC

use Encode::Locale ();
use Encode ();

opendir my $dh, '/some/dir' or die $!;
while ( my $fn = readdir $dh ) {
  $fn = Encode::decode( locale_fs => $fn ); 
  # $fn is just a filename as char stream (unicode)
}
closedir $dh;
[download]

[reply]
[d/l]

Re^3: UTF-8 and readdir, etc.

by Anonymous Monk on Jan 31, 2018 at 18:45 UTC

also instead of

opendir my $dh, '/some/dir' or die $!;

it's better to have

my $dir = '/some/dir'; # unicode
# get it as fs representation
my $fs_dir = Encode::encode( locale_fs => $dir );
opendir my $dh, $fs_dir or die $!;
[download]

[reply]
[d/l]
[select]

Re: UTF-8 and readdir, etc.
by karlgoethebier (Abbot) on Jan 31, 2018 at 15:35 UTC

See Path::Tiny and Path::Iterator::Rule for some alternatives. And Encoding horridness revisited: What's going on here? [SOLVED] might be helpful as well.

Best regards, Karl

俊he Crux of the Biscuit is the Apostrophe�

perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

[reply]
[d/l]

Re^2: UTF-8 and readdir, etc.

by ikegami (Patriarch) on Jan 31, 2018 at 18:05 UTC

What makes you think that either of those modules will help?

[reply]

Re^3: UTF-8 and readdir, etc.

by karlgoethebier (Abbot) on Feb 01, 2018 at 08:41 UTC

My intuition.

俊he Crux of the Biscuit is the Apostrophe�

perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

[reply]
[d/l]

Re^4: UTF-8 and readdir, etc.

by ikegami (Patriarch) on Feb 01, 2018 at 17:41 UTC

Re^5: UTF-8 and readdir, etc.

by karlgoethebier (Abbot) on Feb 01, 2018 at 20:14 UTC

Re: UTF-8 and readdir, etc.
by andal (Hermit) on Feb 01, 2018 at 08:25 UTC

As kcott has already pointed out, you don't need any special handling to only read directory content and then write it to a file. I just hope to add a little explanation to this stuff.

Effectively, there are "bytes" and "characters". In "bytes" each element is only a number from 0 to 255. In "characters" the values have range from 0 to 0xFFFFFFFF and correspond to some image which can be drawn on screen or paper. The "encoding", "unicode", "locale" and other stuff provide description on how to convert "bytes" into "characters" and back. A "string" can be either sequence of "bytes" or sequence of "characters".

Exchange between perl program and OS happens only using "bytes". Inside perl one can work either with "bytes" or with "characters", but results of such work will be different. For example, if your regular expression is supposed to work with unicode characters, then it will fail when applied to "bytes", but it shall work if applied to "characters". So, if you only get data from OS and then immediately return it back to OS, then you don't have to bother with conversion from "bytes" to "characters", it would be just waste of time. On the other hand, if your code has "use utf8", then all your string literals will be automatically presented as "characters" in perl. So, if you decide to pass such literal to OS, then you must convert it from "characters" to "bytes". That is why some people were describing such procedure for opendir here.

There are different ways to do the conversion. One can use Encode module directly, or one can pass ":encoding" to open function, or use some other way. But one has to have clear understanding why the conversion is done, and whether it is needed at all.

[reply]

Re: UTF-8 and readdir, etc.
by afoken (Chancellor) on Feb 01, 2018 at 20:09 UTC

The subthread starting at Re^5: any use of 'use locale'? (source encoding) has some more information about Unicode, filesystems, and APIs that might be helpful.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

[reply]

Re: UTF-8 and readdir, etc.
by Anonymous Monk on Sep 12, 2019 at 22:24 UTC

The comments here are appallingly ignorant, and sadly the perl implementation on Windows follows suit. NTFS filenames are encoded in UTF-16, and perl *could* handle that correctly, but it doesn't. So you have to use something like Win32::Unicode, or if you're using cygwin (as I am), you have to use decode_utf8 when reading directories. Note that File::Find doesn't know this, so that's not usable on Windows.

[reply]

Re^2: UTF-8 and readdir, etc.

by Your Mother (Archbishop) on Sep 12, 2019 at 22:52 UTC

Ignorance, and terrible design, abounds�

NTFS stores file names in Unicode. �The Horse𠏋 Mouth :(

𨩅nd�

NTFS allows any sequence of 16-bit values for name encoding (file names, stream names, index names, etc.) except 0x0000. This means (case insensitive) UTF-16 code units are supported, but the file system does not check whether a sequence is valid UTF-16 (it allows any sequence of short values, not restricted to those in the Unicode standard) 頟ackypardia

[reply]


Think about Loose Coupling
	PerlMonks