Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^5: utf8 in directory and filenames

by Juerd (Abbot)
on Nov 13, 2006 at 20:09 UTC ( [id://583811]=note: print w/replies, xml ) Need Help??


in reply to Re^4: utf8 in directory and filenames
in thread utf8 in directory and filenames

I have read the documents in question, and understand what they say. Really. I have been handling these character set issues for a while, including the Unicode/ISO conversions back and forth in Perl (and iso-8859-x, and "modified utf-7 for IMAP", etc..). I just thought I'd insist on this so that you wouldn't think that I don't understand the basic issues at hand.

Do you understand the difference between a Perl unicode string, and a UTF-8 encoded string? That's a bit more complicated than converting between encodings back and forth, and it's the key issue at hand.

What I did learn from you, was that I should apparently not blindly convert my filenames to utf8

Or anything else. A filename, once converted or encoded, is no longer the same filename.

failed the "-f" test and an open() test, and I was, and still am, trying to figure out why.

You really, really need to have the error message. If you don't want to output it to STDERR or STDOUT, you can open a log file and write it there. Without the error message, you can only guess what's wrong. Guessing absolutely sucks, because it takes too much time.

I now am close to believing that there are gremlins at play.

If you're on Linux, use strace(1) to find where the gremlins are.

Do I just pick up your last message and hit reply, or start a new question ?

You can continue with the old thread, but it's harder to notice the new message then. I hate to say this, but you're better off starting a new thread. Don't forget to refer to the old one.

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Replies are listed 'Best First'.
Re^6: utf8 in directory and filenames
by soliplaya (Beadle) on Nov 14, 2006 at 01:53 UTC
    Unfortunately, I realise that I don't know how to refer to a previous thread.

    But I did find the problem, illustrated by the attached showcase. It is a bit contrived, but it is hard to force Perl to explicitly show what happens. And I believe the problem is rather sneaky. This is not to say that, once you understand the problem, you cannot find a good explanation in the Perl Unicode docs.
    In my original problematic program, the name of the directory which I am reading originates in some other program area, and is passed as an argument to the sub() that does the directory scanning. That's why somehow, it is internally utf8 and why, when I concatenated it with the current dir entry, I got a utf8 string which (sometimes) failed to work in -f and open().
    I say it's sneaky because you get the following cases which give perplexing results :
    - suppose you have a var $path="/abcd", but which for some reason has been sneakily utf8-marked internally by Perl.
    - suppose you use this var as the name of a directory which you scan with readdir()
    - suppose in that directory you have 2 files "josef.txt" and "andré.txt"
    - suppose you read the entries one by one in $name, concatenate them with the directory name (as in $full="$path/$name"), and attempt to open the corresponding file $full
    .. then open(F,"<$full") will work in one case, and fail in the other with a "No such file" error.

    The attached program can be run in any empty directory. It will start by creating 2 directory entries (files), with the same name but using 2 different encodings. Then it re-reads the directory entries, appends the path and tests the combination.
    #!/usr/bin/perl use strict; use warnings; use Encode; # At the Beginning, there was an iso-8859-1 name string.. my $testname = "Presentación.txt"; print "starting string [$testname] " . (Encode::is_utf8($testname) ? " +(utf8)" : "(bytes)") . "\n"; my $fname_iso = $testname; # simply copying leaves it iso bytes print " creating file1 [$fname_iso] " . (Encode::is_utf8($fname_iso) +? "(utf8)" : "(bytes)") . "\n"; open(F1,'>:raw',$fname_iso) or die "cannot open F1 : $!"; print F1 "Hello 1\n"; close F1; my $fname_utf8 = decode('iso-8859-1',$testname); # force internal utf8 print " creating file2 [$fname_utf8] " . (Encode::is_utf8($fname_utf8) + ? "(utf8)" : "(bytes)") . "\n"; open(F2,'>:raw',$fname_utf8) or die "cannot open F2 : $!"; print F2 "Hello 2\n"; close F2; my $dir = "."; # that's iso bytes too by default opendir(DIR,$dir); my @entries = readdir DIR; close DIR; foreach (@entries) { next if $_ =~ /^\./; next if $_ =~ /\.pl$/; # skip myself too print "entry [$_] " . (Encode::is_utf8($_) ? "(utf8)" : "(bytes)") + . "\n"; print " first try :\n"; if (-f "$dir/$_") { # like this, leaves it as bytes print " passes the -f test,"; unless (open(F1,'<',"$dir/$_") ) { print " but cannot be opened : $!\n"; } else { print " and can be opened !\n"; close F1; } } else { print " fails the -f test\n"; } print " 2d try :\n"; my $fullpath = "${dir}/${_}"; # leaves it as bytes also print " trying [$fullpath] " . (Encode::is_utf8($fullpath) ? "(ut +f8)" : "(bytes)") . "\n"; if (-f $fullpath) { print " passes the -f test,"; unless (open(F1,'<',$fullpath) ) { print " but cannot be opened : $!\n"; } else { print " and can be opened !\n"; close F1; } } else { print " fails the -f test\n"; } print " 3d try :\n"; my $dir_utf = decode('iso-8859-1',$dir); # force internal utf8 my $fullpath2 = "${dir_utf}/${_}"; # concatenate forces utf8 flag +on the whole print " trying [$fullpath2] " . (Encode::is_utf8($fullpath2) ? "( +utf8)" : "(bytes)") . "\n"; if (-f $fullpath2) { print " passes the -f test,"; unless (open(F1,'<',$fullpath2) ) { print " but cannot be opened : $!\n"; } else { print " and can be opened !\n"; close F1; } } else { print " fails the -f test,"; unless (open(F1,'<',$fullpath2) ) { print " and fails the open() : $!\n"; } } } exit 0;
    P.S. In the meantime, I still don't know how the original user managed to actually upload a file on my DAV server, from his Windows PC, and have the filename on the Linux server be utf8-encoded. All my attempts have resulted in iso-8859-1 names. I strongly suspect that his station was Windows XP Home (while mine is a Pro), and that the DAV client is not the same.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://583811]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-19 22:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found