Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re: Perl / FileFind or ...

by graff (Chancellor)
on Nov 28, 2012 at 04:53 UTC ( #1005956=note: print w/replies, xml ) Need Help??

in reply to Perl / FileFind or ...

What makes you think that handling non-ASCII characters in path/file names should be simple? I suppose that if you have intimate knowledge about the OS you're using, and about the file system installed on the specific disk volume you're using, and about the capabilities of the particular terminal/browser/other application that is trying to display file name strings on your monitor, and about the environment/configuration settings that control the behavior of that application, and about the process(es) that created the file names on that specific disk volume in the first place, then you might know enough for the handling of non-ASCII file names to seem "simple."

But if you lack intimate knowledge on any of those topics, your first resort should be to get a hex-dump view of the byte sequences being used in any given file name string. That way, all you need is a general knowledge of the possible non-ASCII character encodings, and perhaps some presupposition about the (human) language being used by the person who assigned the file name (or at least, some sense of the alphabet being used - Cyrillic? Greek? Latin? Arabic? ... - including the range of diacritic marks, odd-ball punctuation and/or special symbols that are likely to show up). Not that this in itself is "simple", but at least there are fewer moving parts.

Obviously, getting a hex-dump style output just gets in the way when file paths contain nothing outside the printable ASCII range, so a useful elaboration of your File::Find callback might go something like this:

sub cbFileFind { my $printable_name = $File::Find::name; $printable_name =~ s/([^ -~])/sprintf("\\x{%02x}",ord($1))/eg; print $printable_name, "\n"; }
If you happen to already know (or if the approach just shown makes it clear) what the particular character encoding is for the non-ASCII portions of your file names, you can use Encode to convert (decode) the strings as read from the file system into perl-internal (utf8) encoding, and then the "ord()" function will return unicode code-point numbers. which you can look up in case the particular characters are unfamiliar to you (check out Re: Regular expressions and accents and tlu -- TransLiterate Unicode).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1005956]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (12)
As of 2017-09-26 13:23 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (294 votes). Check out past polls.