http://www.perlmonks.org?node_id=1208224


in reply to UTF-8 and readdir, etc.

G'day John,

Welcome to the Monastery.

"Maybe I've overlooked the obvious (if so I apologise)."

Showing us your "simple perl program" and describing your actual problem with example input, output, error messages, and so on, would probably result in a better answer. As it is, we need to fall back to guesswork. I appreciate this is your first post, and I'm not trying to beat you over the head with the rule book, but please read "How do I post a question effectively?" and "Short, Self-Contained, Correct Example" to find out what sort of information to post in order to get the best answers.

One minor deviation from the information given in that first link: use '<pre>' blocks, instead of '<code>' blocks, for presenting Unicode data outside the 7-bit ASCII range. With '<code>' blocks, your Unicode characters will typically end up being shown as entity references, i.e. something like '&#NNNNNN;'. This won't happen with '<pre>' blocks; however, the drawbacks are there's no "[download]" link, and you have to manually change special characters in your code and data (e.g. '<' and '&') to their entities (e.g. '&lt;' and '&amp;') — there's a list of these after the textarea where you write your post. For inline Unicode characters, e.g. inside a '<p>' or '<li>' block, I typically use '<tt>' instead of '<pre>': this is to avoid '<pre>' being forced into a block format by, for instance, a style sheet.

Other than the actual tree walking, which could be part of your problem, the following script ("pm_1208191_read_utf8_filenames.pl") performs the reading and writing tasks you specify.

#!/usr/bin/env perl -l use strict; use warnings; use autodie; my $dir = 'pm_1208191_utf8_filenames'; my $out = 'pm_1208191_utf8_filenames_listing.txt'; open my $fh, '>', $out; opendir(my $dh, $dir); print $fh $_ while readdir $dh;

Given this test directory I set up:

$ ls -al pm_1208191_utf8_filenames
total 0
drwxr-xr-x  7 ken staff 238 Feb  1 11:45 .
drwxr-xr-x 18 ken staff 612 Feb  1 11:34 ..
-rw-r--r--  1 ken staff   0 Feb  1 11:34 abc
-rw-r--r--  1 ken staff   0 Feb  1 11:36 åßç
-rw-r--r--  1 ken staff   0 Feb  1 11:38 αβγ
-rw-r--r--  1 ken staff   0 Feb  1 11:41 абг
-rw-r--r--  1 ken staff   0 Feb  1 11:45 ☿♃♄

Here's a sample run:

$ cat pm_1208191_utf8_filenames_listing.txt
cat: pm_1208191_utf8_filenames_listing.txt: No such file or directory
$ pm_1208191_read_utf8_filenames.pl
$ cat pm_1208191_utf8_filenames_listing.txt
.
..
abc
åßç
αβγ
абг
☿♃♄

As you can see, I didn't need any special encoding-type directives. I'm using Perl 5.26.0; MacOS 10.12.5; and I have 'LANG=en_AU.UTF-8' (normal setting).

In case you can't actually see some of those characters, here's a table of the filenames, the three codepoints used for each, and a link to the Unicode PDF code chart so you can see what they look like.

FilenameCodepointsCode Chart (PDF link)
abcU+0061, U+0062, U+0063C0 Controls and Basic Latin
åßçU+00E5, U+00DF, U+00E7C1 Controls and Latin-1 Supplement
αβγU+03B1, U+03B2, U+03B3Greek and Coptic
абгU+0430, U+0431, U+0433Cyrillic
☿♃♄U+263F, U+2643, U+2644Miscellaneous Symbols

Take a look at "Re: printing Unicode works for some characters but not all", which I wrote some months ago. This may shed some light on whatever problems you're encountering — clearly, this is one of those guesswork answers I mentioned earlier.

The open pragma statement you're looking for might be something like:

use open IO => qw{:encoding(UTF-8) :std};

Again, that's more guesswork as you haven't shown your script or adequately described your problem.

— Ken