http://www.perlmonks.org?node_id=991114


in reply to Re^3: How can I get a Unicode @ARGV?
in thread How can I get a Unicode @ARGV?

In cp950: dir/b It shows the files with unicode char correctly. However, dir/b > list.txt the content inside list turns into "????"

In cp65001: dir/b It shows the files with monster char. However, dir/b > list.txt gives the correct list.

Most confused me is that when you can see the String right, it doesn't mean the Data is right, vice versa. And I actually have no idea why Unicode chars can show correctly when dir (or drop a file path in the cmd console) in cp950, but then can't manipulate(@ARGV) later.

Replies are listed 'Best First'.
Re^5: How can I get a Unicode @ARGV?
by remiah (Hermit) on Sep 01, 2012 at 13:24 UTC
    Hello.

    Can you do this ? or not ?

    1. Paste the function below into your script.
    2. This will ouput encoding information of $ARGV[0] to logfile.txt

    troubled_string($ARGV[0], "logfile.txt");
    3. Examine logfile.txt with your browser, and find normal string, changing "Encodings" of your browser.
    sub troubled_string{ my ($str,$logfile_path)=@_; use Encode qw(decode encode from_to encodings); open (my $fh, ">", $logfile_path) or die $!; printf $fh "utf8 flag:%s\n",utf8::is_utf8($str) ? "utf8 flagged" : + "not utf8 flagged"; printf $fh "hexdump:[%s]\n",utf8::is_utf8($str) ? unpack('U0H*', $ +str):unpack('H*', $s tr); printf $fh "length:[%s]\n", length($str); printf $fh "%s\n", "-" x 20; if ( utf8::is_utf8($str) ){ printf $fh "[encode trial]\n"; printf $fh "%-25s:%s\n",$_,encode($_, $str) for (Encode->encod +ings(":all")); } else { printf $fh "[decode -> encode trial]\n"; printf $fh "%-25s:%s\n",$_,encode($_, decode($_,$str)) for (En +code->encodings(":all")); } close $logfile_path; }
    What does it say?

Re^5: How can I get a Unicode @ARGV?
by nikosv (Deacon) on Sep 01, 2012 at 13:21 UTC

    the 'dir' command in Win cmd works in Unicode regardless the code page;it's one of those Windows quirks.Go ahead change the code page to eg Cyrillic and try it out,you'll get the same result

    In cp950: dir/b It shows the files with Unicode char correctly. However, dir/b > list.txt the content inside list turns into "????"

    Is list.txt saved as ANSI (default)?. Save list.txt as Unicode and try again

    In cp65001: dir/b It shows the files with monster char. However, dir/b > list.txt gives the correct list.

    I guess cp65001 sets the file i/o to Unicode,that is why you see list.txt with correct list what is a monster char? maybe a font issue? what font are you using, Lucida Console?