Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: UTF-8 and readdir, etc.

by kcott (Chancellor)
on Feb 01, 2018 at 02:28 UTC ( #1208224=note: print w/replies, xml ) Need Help??

in reply to UTF-8 and readdir, etc.

G'day John,

Welcome to the Monastery.

"Maybe I've overlooked the obvious (if so I apologise)."

Showing us your "simple perl program" and describing your actual problem with example input, output, error messages, and so on, would probably result in a better answer. As it is, we need to fall back to guesswork. I appreciate this is your first post, and I'm not trying to beat you over the head with the rule book, but please read "How do I post a question effectively?" and "Short, Self-Contained, Correct Example" to find out what sort of information to post in order to get the best answers.

One minor deviation from the information given in that first link: use '<pre>' blocks, instead of '<code>' blocks, for presenting Unicode data outside the 7-bit ASCII range. With '<code>' blocks, your Unicode characters will typically end up being shown as entity references, i.e. something like '&#NNNNNN;'. This won't happen with '<pre>' blocks; however, the drawbacks are there's no "[download]" link, and you have to manually change special characters in your code and data (e.g. '<' and '&') to their entities (e.g. '&lt;' and '&amp;') — there's a list of these after the textarea where you write your post. For inline Unicode characters, e.g. inside a '<p>' or '<li>' block, I typically use '<tt>' instead of '<pre>': this is to avoid '<pre>' being forced into a block format by, for instance, a style sheet.

Other than the actual tree walking, which could be part of your problem, the following script ("") performs the reading and writing tasks you specify.

#!/usr/bin/env perl -l use strict; use warnings; use autodie; my $dir = 'pm_1208191_utf8_filenames'; my $out = 'pm_1208191_utf8_filenames_listing.txt'; open my $fh, '>', $out; opendir(my $dh, $dir); print $fh $_ while readdir $dh;

Given this test directory I set up:

$ ls -al pm_1208191_utf8_filenames
total 0
drwxr-xr-x  7 ken staff 238 Feb  1 11:45 .
drwxr-xr-x 18 ken staff 612 Feb  1 11:34 ..
-rw-r--r--  1 ken staff   0 Feb  1 11:34 abc
-rw-r--r--  1 ken staff   0 Feb  1 11:36 åç
-rw-r--r--  1 ken staff   0 Feb  1 11:38 αβγ
-rw-r--r--  1 ken staff   0 Feb  1 11:41 абг
-rw-r--r--  1 ken staff   0 Feb  1 11:45 ☿♃♄

Here's a sample run:

$ cat pm_1208191_utf8_filenames_listing.txt
cat: pm_1208191_utf8_filenames_listing.txt: No such file or directory
$ cat pm_1208191_utf8_filenames_listing.txt

As you can see, I didn't need any special encoding-type directives. I'm using Perl 5.26.0; MacOS 10.12.5; and I have 'LANG=en_AU.UTF-8' (normal setting).

In case you can't actually see some of those characters, here's a table of the filenames, the three codepoints used for each, and a link to the Unicode PDF code chart so you can see what they look like.

FilenameCodepointsCode Chart (PDF link)
abcU+0061, U+0062, U+0063C0 Controls and Basic Latin
åçU+00E5, U+00DF, U+00E7C1 Controls and Latin-1 Supplement
αβγU+03B1, U+03B2, U+03B3Greek and Coptic
абгU+0430, U+0431, U+0433Cyrillic
☿♃♄U+263F, U+2643, U+2644Miscellaneous Symbols

Take a look at "Re: printing Unicode works for some characters but not all", which I wrote some months ago. This may shed some light on whatever problems you're encountering — clearly, this is one of those guesswork answers I mentioned earlier.

The open pragma statement you're looking for might be something like:

use open IO => qw{:encoding(UTF-8) :std};

Again, that's more guesswork as you haven't shown your script or adequately described your problem.

— Ken

Replies are listed 'Best First'.
Re^2: UTF-8 and readdir, etc.
by Anonymous Monk on Feb 01, 2018 at 09:14 UTC
    Some notes: Will this work if $dir has at least one character >= 128? I think it needs 'use utf8;' and encoding it to the filesystem encoding before opendir call. Will this work when all three encodings (filesystem, result file, code source file) aren't the same (Windows for example)? I doubt it
      "Will this work if $dir has at least one character >= 128?"

      If "128" refers to the return value of "ord($character)" (see ord), my test data uses such characters. If you meant something else, please explain.

      "I think it needs 'use utf8;'"

      The source code I provided is written entirely using 7-bit ASCII characters. The utf8 pragma is definitely not required here. I suggest you read that documentation, paying particular attention to this part (which it shows in bold text):

      "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."
      "Will this work when all three encodings (filesystem, result file, code source file) aren't the same (Windows for example)?"

      The OP stated that "The host OS is Linux, and is configured to use UTF-8 for filenames; the contents of the output file are also encoded as UTF-8.".

      — Ken

        What character in

        my $dir = 'pm_1208191_utf8_filenames';

        has ord() >= 128?

        I meant 'use utf8;' needed if there is an actual char with ord() >= 128 in $dir string.

        The reason behind my post was that your suggestion isn't a valid unicode processing. It cover only one specific case where encodings of fs/result file/code are the same. Just that case I wanted to highlight that

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1208224]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2018-12-14 07:16 GMT
Find Nodes?
    Voting Booth?
    How many stories does it take before you've heard them all?

    Results (64 votes). Check out past polls.