http://www.perlmonks.org?node_id=1005956


in reply to Perl / FileFind or ...

What makes you think that handling non-ASCII characters in path/file names should be simple? I suppose that if you have intimate knowledge about the OS you're using, and about the file system installed on the specific disk volume you're using, and about the capabilities of the particular terminal/browser/other application that is trying to display file name strings on your monitor, and about the environment/configuration settings that control the behavior of that application, and about the process(es) that created the file names on that specific disk volume in the first place, then you might know enough for the handling of non-ASCII file names to seem "simple."

But if you lack intimate knowledge on any of those topics, your first resort should be to get a hex-dump view of the byte sequences being used in any given file name string. That way, all you need is a general knowledge of the possible non-ASCII character encodings, and perhaps some presupposition about the (human) language being used by the person who assigned the file name (or at least, some sense of the alphabet being used - Cyrillic? Greek? Latin? Arabic? ... - including the range of diacritic marks, odd-ball punctuation and/or special symbols that are likely to show up). Not that this in itself is "simple", but at least there are fewer moving parts.

Obviously, getting a hex-dump style output just gets in the way when file paths contain nothing outside the printable ASCII range, so a useful elaboration of your File::Find callback might go something like this:

sub cbFileFind { my $printable_name = $File::Find::name; $printable_name =~ s/([^ -~])/sprintf("\\x{%02x}",ord($1))/eg; print $printable_name, "\n"; }
If you happen to already know (or if the approach just shown makes it clear) what the particular character encoding is for the non-ASCII portions of your file names, you can use Encode to convert (decode) the strings as read from the file system into perl-internal (utf8) encoding, and then the "ord()" function will return unicode code-point numbers. which you can look up in case the particular characters are unfamiliar to you (check out Re: Regular expressions and accents and tlu -- TransLiterate Unicode).