Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Unicode (, , in German) Problem with File::Find under Windows2000

by TeddyC (Novice)
on Sep 02, 2003 at 10:07 UTC ( #288277=perlquestion: print w/replies, xml ) Need Help??

TeddyC has asked for the wisdom of the Perl Monks concerning the following question:

Hallo Monks,

I 'm new to Perl and not be able to locate my problem exactly. I've searched but can't find the right solution.I hope here is a right place to post.

Env: Win2000(en) +ActivePerl 5.8+ Komodo

The trouble begins with special characters in German. I want to copy a whole directory to another location , a prototype got no problem with direct writing paths of source and destination in program:

1) define $srcdir in perl-program
$dstdir in perl-program
like this:

use File::Find; ... # there are directory d:\perl\source\test\a and file d:\perl\source\ +test\a\.txt under \test\ $srcdir="d:\\perl\\source\\test"; $dstdir="d:\\test2"; finddepth(\&check_and_copy, $srcdir); # sub check_and_copy check the time of file/dir and only copy the upd +ated file/dir to $dstdir
Then I wrote srcdir & dstdir in config.xml and use
XML:Simple to read and got them in UTF8 format but nothing is copied!
2) define $srcdir in xml
$dstdir in xml

Komodo shows that first $_ in "finddepth" is "d:\perl\source\test\a" ( in Hex)
There are someting not correct: it should be "d:\perl\source\test\a\.txt" by depth first search( just like in 1) )
Then I run the Program under Dos, it shows a warning: "Can't cd to <d:\perl\source\test/> a ..." (here can't display correctly")

--Is there something wrong in File::Find with utf8 or with utf8 under win2k? --Or it depends on somethings else?

I made some tests but I can't understand the results well.

3) define $srcdir in perl-program
$dstdir in xml
# the files is copied but the charater shows in Ascii format.

4) define $srcdir in perl-program and test

$srcdir="d:\\perl\\source\\test"; if(Encode::is_utf8($srcdir) ){ print " src is_utf8\n"; }else{ print " src is_NOT_utf8\n"; } # I got "src is_NOT_utf8", I heared Perl use UTF8 internal. but it se +ems that's not so simple.

5).define $srcdir with "" in perl-program and use utf8

use utf8; $srcdir="d:\\perl\\source\\test\\a"; # I got "Malformed UTF-8 character (unexpected non-continuation byte 0 +x61, immediately after start byte 0xf6)"

--What 's the encoding of a Perl- program in Komodo?
--How can I got the encoding's name of a string?
--How can I got the right output for ", ," in DOS? (I got correct character in Komodo with "binmode (STDOUT,"utf8");" )

Thanks for your attention!

Replies are listed 'Best First'.
Re: Unicode (, , in German) Problem with File::Find under Windows2000
by Thelonius (Priest) on Sep 02, 2003 at 15:59 UTC
    Perl 5.8 is capable of using UTF8 internally, but it can't always tell if a string of octets is UTF8, latin1, big5, ..., or just some random binary data.

    It looks like readdir is returning a latin1 encoding of the name. This leaves the interesting question of what it would do with some name that isn't representable in latin1.

    As BrowserUK indicated, the error indicates that the string in your source file is not encoded as a utf8 string. If your editor is capable of using utf8, you can still use it in your program. For example, in vim ":set encoding=utf8". That will work, but may not convert pre-existing non-ascii characters.

    I tried a test with File::Find, and finddepth seemed to work okay.

    If you want to display non-ASCII data in a DOS box, you need to convert it to the correct code page. Here's an example program:

    #!perl -w use Encode; use utf8; my $test = "This is a test. Gdel"; my $cp = `chcp`; # get code page from DOS CHCP command if ($cp =~ /(\d+)/) { $cp = "cp$1"; } else { $cp = "cp437"; } binmode STDOUT, ":encoding($cp)" or die "Error on binmode: $!"; print STDOUT "$test\n";
      Addendum: I found out how to get UTF8 results from readdir, if you need them (for German, you don't). You use the "perl -C" flag or set ${^WIDE_SYSTEM_CALLS}.

      Right now it's slightly broken because it returns utf8 strings, but doesn't set the utf8 flag on the strings. There is a workaround:

      #!perl -w # use File::Find; use strict; use Encode qw(decode_utf8 is_utf8); my $start = "/home/Hirschk/pmonks/utftest"; { local ${^WIDE_SYSTEM_CALLS} = 1; finddepth( \&showme, $start ); } sub fixutf8 { for (@_) { if (${^WIDE_SYSTEM_CALLS} && !is_utf8($_)) { $_ = decode_utf8($_); } } } sub showme { fixutf8($File::Find::dir,$File::Find::name,$_); print "\$_ = $_\n"; }
      The fixutf8 function should, well, fix it.

        Thank you BrowserUk! Thank you Thelonius!
        I think I got more than I hoped.

        because of some other trouble, i can try a little more since this morning.
        Thelonius,I 've tried your code and I think something will happen in File::Find::finddepth before you fix it.
        I used XML to get UTF8 string (similar to my old program).

        <?xml version="1.0" encoding="UTF-8" ?> <config> <srcdir>d:\temp\source\test2</srcdir> <dstdir>d:\temp\source\test5</dstdir> </config>
        #!d:\perl\bin\perl.exe -w use File::Find; use strict; use Encode qw(encode_utf8 decode_utf8 is_utf8); use XML::Simple; my $configfile=".\\config.xml"; my $config=XMLin($configfile); my $srcdir="d:\\temp\\source\\test2"; print "\$srcdir: $srcdir\n"; if(is_utf8($srcdir)){ print "is utf8\n"; }else{ print "is NOT utf8\n"; $srcdir=decode_utf8($srcdir); # ??? } # line "!!!" get srcdir from xml # or you can comment it to test # wether line "???" take any effect or not $srcdir=$$config{'srcdir'}; # !!! if(is_utf8($srcdir)){ print "is utf8\n"; }else{ print "is NOT utf8\n"; } { local ${^WIDE_SYSTEM_CALLS} = 1; finddepth( \&showme, $srcdir ); } sub fixutf8 { for (@_) { if (${^WIDE_SYSTEM_CALLS} && !is_utf8($_)) { $_ = decode_utf8($_); } } } sub showme { print "\$_ = $_\n"; fixutf8($File::Find::dir,$File::Find::name,$_); print "\$_ = $_\n"; }
        And I got results in Dos but It's NOT depth first!
        D:\temp\source> $srcdir: d:\temp\source\test2 is NOT utf8 is utf8 Can't cd to (d:\temp\source\test2/) &#9500;&#9516;&#9570;a: No such f +ile or directory at D:\temp\source\ line 28 $_ = &#9500;&#9570;a $_ = &#9500;&#9570;a $_ = . $_ = .
        and in Komodo
        $srcdir: d:\temp\source\test2 is NOT utf8 is utf8 $_ = a $_ = a $_ = . $_ = .
        and if I comment "!!!" , i got in Komodo
        Line "???" takes NO effect, but it's depth first
        $srcdir: d:\temp\source\test2 is NOT utf8 is NOT utf8 $_ = .txt $_ = &#52212;xt $_ = a $_ = &#30797; $_ = . $_ = .
        (I've set UTF8 as editor encoding in Komodo's Preference, some character can't be posted here correctly, see Note from BrowserUK)
        Then I've tested the fixutf8.
        sub fixutf8 { for (@_) { print "\$_=$_"; if (${^WIDE_SYSTEM_CALLS} && !is_utf8($_)) { $_ = decode_utf8($_); } if(is_utf8($_)){ print "#\$_=$_ is utf8\n"; }else{ print "#\$_=$_ is NOT utf8\n"; } } }
        then get
        $srcdir: d:\temp\source\test2 is NOT utf8 is NOT utf8 $_ = .txt $_=d:\temp\source\test2/a#$_=d:\temp\source\test2/&#30816;is utf8 $_=d:\temp\source\test2/a/.txt#$_=d:\temp\source\test2/&#30831;&#522 +12;xt is utf8 $_=.txt#$_=&#52212;xt is utf8 $_ = &#52212;xt $_ = a $_=d:\temp\source\test2#$_=d:\temp\source\test2 is NOT utf8 $_=d:\temp\source\test2/a#$_=d:\temp\source\test2/&#30816;is utf8 $_=a#$_=&#30816;is utf8 $_ = &#30797; $_ = . $_=d:\temp\source\test2#$_=d:\temp\source\test2 is NOT utf8 $_=d:\temp\source\test2#$_=d:\temp\source\test2 is NOT utf8 $_=.#$_=. is NOT utf8 $_ = .

        So, I guess,
        If I give the "finddepth" a UTF8 dirname,then it get a Ascii name of child node but can't handle them correctly like the first 2 results in Dos /komodo

        If I give the "finddepth" a normal string with the program format, it has no problem to handle them just like last result.

        finally I use the plain text als config file...
        somehow disapointed.
        But I still can't understand,
        --Why the line "???" takes no effect?
        --According to the Thelonius' Post , there is no function like getEncoding but what is the encoding in the Program?

        btw. if you visit i also posted), you can see some other German-in-Win32 problems, for German in Dos there is a solution from Crian

Re: Unicode (�, �, � in German) Problem with File::Find under Windows2000
by BrowserUk (Pope) on Sep 02, 2003 at 13:46 UTC

    The problem is that the pathname "öa" is not utf-8. It is extended (8-bit) ascii. presumably created via the command line. To demonstrate this, try the following one liner.

    P:\test>perl58 -le"$f=qq[test-\xf6\xf6\xf6-test]; print $f; open F,'>' +,$f; print F 'this is the file'; close F;" test-÷÷÷-test P:\test>dir test-* Volume in drive P is Winnt Volume Serial Number is D822-5AE5 Directory of P:\test 02/09/03 14:26 18 test-ööö-test 1 File(s) 18 bytes 1,098,700,800 bytes free P:\test>type test-* test-ööö-test this is the file P:\test>

    The character first character in "öa" is (extended) ascii 246 decimal (0xf6). This is an illegal character in utf-8. When perl attempts to treat a string containing this as utf, it see's the first character as the first byte of a two byte utf character and inspects the next byte to form the complete character. However, the next byte 'a' ascii 97 (0x61) is not a valid continuation byte for utf, hence the error message.

    P:\test>perl -le"use utf8; $x = qq[öa]; print $x;" Malformed UTF-8 character (unexpected non-continuation byte 0x61, imme +diately after start byte 0xf6) at -e line 1. a

    Here you can see that with use utf8 in force, the error occurs. Unsurprising, as the data is not correctly formed utf. The solution is to disable utf when dealing with extended ascii as can be seen here.

    P:\test>perl -le"no utf8; $x = qq[öa]; print $x;" ÷a

    In other words. You will probably bypass the problem by disabling utf8 by placing no utf8; at the top of your program.

    The reason that the output (from perl) displays differently to the input--as '÷a' rather than 'öa'-- is something to do with ascii code to glyph mapping, ie. codepage settings (I think). If anyone has a better or fuller explanation of this, I'd like to hear it also.

    NOTE: Whether the characters in this post will show up correctly when displayed in your browser will depend upon your browser and it's handling of character encoding. It looks fine in Opera 6.1 with the encoding set to "automatic", but I have seen it before where stuff looks fine for me, but shows up as "wierd characters" in other browser or with different settings.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://288277]
Approved by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2020-02-27 05:42 GMT
Find Nodes?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?

    Results (118 votes). Check out past polls.