Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

UTF-8 and readdir, etc.

by jrw005 (Initiate)
on Jan 31, 2018 at 12:08 UTC ( [id://1208191]=perlquestion: print w/replies, xml ) Need Help??

jrw005 has asked for the wisdom of the Perl Monks concerning the following question:

Maybe I've overlooked the obvious (if so I apologise).

I have a simple perl program that walks a directory tree and - amongst other things - records each file and directory name in a text file. The host OS is Linux, and is configured to use UTF-8 for filenames; the contents of the output file are also encoded as UTF-8. I realise there are many pitfalls when dealing with Unicode and UTF-8, and this is obviously not dealing with a general case - but I would think it's a fairly commonplace requirement.

So, firstly, is there something analogous to "use open ':encoding(utf8)';" that will have an equivalent effect on the result(s) of "readdir(...);" or will I need to use "decode_utf8(readdir(...))" every time instead?

And secondly, is there likewise a way for a perl program to be notified that system calls that access the host filesystem - for example open(), stat(), readlink() - should encode filenames as UTF-8?

John

Replies are listed 'Best First'.
Re: UTF-8 and readdir, etc.
by thanos1983 (Parson) on Jan 31, 2018 at 12:26 UTC
Re: UTF-8 and readdir, etc.
by kcott (Archbishop) on Feb 01, 2018 at 02:28 UTC

    G'day John,

    Welcome to the Monastery.

    "Maybe I've overlooked the obvious (if so I apologise)."

    Showing us your "simple perl program" and describing your actual problem with example input, output, error messages, and so on, would probably result in a better answer. As it is, we need to fall back to guesswork. I appreciate this is your first post, and I'm not trying to beat you over the head with the rule book, but please read "How do I post a question effectively?" and "Short, Self-Contained, Correct Example" to find out what sort of information to post in order to get the best answers.

    One minor deviation from the information given in that first link: use '<pre>' blocks, instead of '<code>' blocks, for presenting Unicode data outside the 7-bit ASCII range. With '<code>' blocks, your Unicode characters will typically end up being shown as entity references, i.e. something like '&#NNNNNN;'. This won't happen with '<pre>' blocks; however, the drawbacks are there's no "[download]" link, and you have to manually change special characters in your code and data (e.g. '<' and '&') to their entities (e.g. '&lt;' and '&amp;') — there's a list of these after the textarea where you write your post. For inline Unicode characters, e.g. inside a '<p>' or '<li>' block, I typically use '<tt>' instead of '<pre>': this is to avoid '<pre>' being forced into a block format by, for instance, a style sheet.

    Other than the actual tree walking, which could be part of your problem, the following script ("pm_1208191_read_utf8_filenames.pl") performs the reading and writing tasks you specify.

    #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $dir = 'pm_1208191_utf8_filenames'; my $out = 'pm_1208191_utf8_filenames_listing.txt'; open my $fh, '>', $out; opendir(my $dh, $dir); print $fh $_ while readdir $dh;

    Given this test directory I set up:

    $ ls -al pm_1208191_utf8_filenames
    total 0
    drwxr-xr-x  7 ken staff 238 Feb  1 11:45 .
    drwxr-xr-x 18 ken staff 612 Feb  1 11:34 ..
    -rw-r--r--  1 ken staff   0 Feb  1 11:34 abc
    -rw-r--r--  1 ken staff   0 Feb  1 11:36 åßç
    -rw-r--r--  1 ken staff   0 Feb  1 11:38 αβγ
    -rw-r--r--  1 ken staff   0 Feb  1 11:41 абг
    -rw-r--r--  1 ken staff   0 Feb  1 11:45 ☿♃♄
    

    Here's a sample run:

    $ cat pm_1208191_utf8_filenames_listing.txt
    cat: pm_1208191_utf8_filenames_listing.txt: No such file or directory
    $ pm_1208191_read_utf8_filenames.pl
    $ cat pm_1208191_utf8_filenames_listing.txt
    .
    ..
    abc
    åßç
    αβγ
    абг
    ☿♃♄
    

    As you can see, I didn't need any special encoding-type directives. I'm using Perl 5.26.0; MacOS 10.12.5; and I have 'LANG=en_AU.UTF-8' (normal setting).

    In case you can't actually see some of those characters, here's a table of the filenames, the three codepoints used for each, and a link to the Unicode PDF code chart so you can see what they look like.

    FilenameCodepointsCode Chart (PDF link)
    abcU+0061, U+0062, U+0063C0 Controls and Basic Latin
    åßçU+00E5, U+00DF, U+00E7C1 Controls and Latin-1 Supplement
    αβγU+03B1, U+03B2, U+03B3Greek and Coptic
    абгU+0430, U+0431, U+0433Cyrillic
    ☿♃♄U+263F, U+2643, U+2644Miscellaneous Symbols

    Take a look at "Re: printing Unicode works for some characters but not all", which I wrote some months ago. This may shed some light on whatever problems you're encountering — clearly, this is one of those guesswork answers I mentioned earlier.

    The open pragma statement you're looking for might be something like:

    use open IO => qw{:encoding(UTF-8) :std};

    Again, that's more guesswork as you haven't shown your script or adequately described your problem.

    — Ken

      Some notes: Will this work if $dir has at least one character >= 128? I think it needs 'use utf8;' and encoding it to the filesystem encoding before opendir call. Will this work when all three encodings (filesystem, result file, code source file) aren't the same (Windows for example)? I doubt it
        "Will this work if $dir has at least one character >= 128?"

        If "128" refers to the return value of "ord($character)" (see ord), my test data uses such characters. If you meant something else, please explain.

        "I think it needs 'use utf8;'"

        The source code I provided is written entirely using 7-bit ASCII characters. The utf8 pragma is definitely not required here. I suggest you read that documentation, paying particular attention to this part (which it shows in bold text):

        "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."
        "Will this work when all three encodings (filesystem, result file, code source file) aren't the same (Windows for example)?"

        The OP stated that "The host OS is Linux, and is configured to use UTF-8 for filenames; the contents of the output file are also encoded as UTF-8.".

        — Ken

Re: UTF-8 and readdir, etc.
by IB2017 (Pilgrim) on Jan 31, 2018 at 18:15 UTC

    Hello irw005. This is the subrutine I use to read the file names in a directory with unicode names (use Win32::Unicode::Dir)

    sub ReadDir{ my $Directory=shift; my @Documents; print "Reading directory's content $Directory\n"; my $wdir = Win32::Unicode::Dir->new; $wdir->open($Directory) || die; for ($wdir->fetch) { next if /^\.{1,2}$/; push (@Documents, $_) } $wdir->close || dieW $wdir->error; return \@Documents; }
      use Encode::Locale (); use Encode (); opendir my $dh, '/some/dir' or die $!; while ( my $fn = readdir $dh ) { $fn = Encode::decode( locale_fs => $fn ); # $fn is just a filename as char stream (unicode) } closedir $dh;

        also instead of

        opendir my $dh, '/some/dir' or die $!;

        it's better to have

        my $dir = '/some/dir'; # unicode # get it as fs representation my $fs_dir = Encode::encode( locale_fs => $dir ); opendir my $dh, $fs_dir or die $!;
Re: UTF-8 and readdir, etc.
by karlgoethebier (Abbot) on Jan 31, 2018 at 15:35 UTC

      What makes you think that either of those modules will help?

        My intuition.

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: UTF-8 and readdir, etc.
by andal (Hermit) on Feb 01, 2018 at 08:25 UTC

    As kcott has already pointed out, you don't need any special handling to only read directory content and then write it to a file. I just hope to add a little explanation to this stuff.

    Effectively, there are "bytes" and "characters". In "bytes" each element is only a number from 0 to 255. In "characters" the values have range from 0 to 0xFFFFFFFF and correspond to some image which can be drawn on screen or paper. The "encoding", "unicode", "locale" and other stuff provide description on how to convert "bytes" into "characters" and back. A "string" can be either sequence of "bytes" or sequence of "characters".

    Exchange between perl program and OS happens only using "bytes". Inside perl one can work either with "bytes" or with "characters", but results of such work will be different. For example, if your regular expression is supposed to work with unicode characters, then it will fail when applied to "bytes", but it shall work if applied to "characters". So, if you only get data from OS and then immediately return it back to OS, then you don't have to bother with conversion from "bytes" to "characters", it would be just waste of time. On the other hand, if your code has "use utf8", then all your string literals will be automatically presented as "characters" in perl. So, if you decide to pass such literal to OS, then you must convert it from "characters" to "bytes". That is why some people were describing such procedure for opendir here.

    There are different ways to do the conversion. One can use Encode module directly, or one can pass ":encoding" to open function, or use some other way. But one has to have clear understanding why the conversion is done, and whether it is needed at all.

Re: UTF-8 and readdir, etc.
by afoken (Chancellor) on Feb 01, 2018 at 20:09 UTC

    The subthread starting at Re^5: any use of 'use locale'? (source encoding) has some more information about Unicode, filesystems, and APIs that might be helpful.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: UTF-8 and readdir, etc.
by Anonymous Monk on Sep 12, 2019 at 22:24 UTC
    The comments here are appallingly ignorant, and sadly the perl implementation on Windows follows suit. NTFS filenames are encoded in UTF-16, and perl *could* handle that correctly, but it doesn't. So you have to use something like Win32::Unicode, or if you're using cygwin (as I am), you have to use decode_utf8 when reading directories. Note that File::Find doesn't know this, so that's not usable on Windows.

      Ignorance, and terrible design, abounds–

      NTFS stores file names in Unicode.The Horse’s Mouth :(

      –and–

      NTFS allows any sequence of 16-bit values for name encoding (file names, stream names, index names, etc.) except 0x0000. This means (case insensitive) UTF-16 code units are supported, but the file system does not check whether a sequence is valid UTF-16 (it allows any sequence of short values, not restricted to those in the Unicode standard) –Wackypardia

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1208191]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-23 21:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found