in reply to Perl / FileFind or ...
But if you lack intimate knowledge on any of those topics, your first resort should be to get a hex-dump view of the byte sequences being used in any given file name string. That way, all you need is a general knowledge of the possible non-ASCII character encodings, and perhaps some presupposition about the (human) language being used by the person who assigned the file name (or at least, some sense of the alphabet being used - Cyrillic? Greek? Latin? Arabic? ... - including the range of diacritic marks, odd-ball punctuation and/or special symbols that are likely to show up). Not that this in itself is "simple", but at least there are fewer moving parts.
Obviously, getting a hex-dump style output just gets in the way when file paths contain nothing outside the printable ASCII range, so a useful elaboration of your File::Find callback might go something like this:
If you happen to already know (or if the approach just shown makes it clear) what the particular character encoding is for the non-ASCII portions of your file names, you can use Encode to convert (decode) the strings as read from the file system into perl-internal (utf8) encoding, and then the "ord()" function will return unicode code-point numbers. which you can look up in case the particular characters are unfamiliar to you (check out Re: Regular expressions and accents and tlu -- TransLiterate Unicode).sub cbFileFind { my $printable_name = $File::Find::name; $printable_name =~ s/([^ -~])/sprintf("\\x{%02x}",ord($1))/eg; print $printable_name, "\n"; }
|
---|