Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: How can I get a Unicode @ARGV?

by exilepanda (Friar)
on Aug 31, 2012 at 13:28 UTC ( [id://991005]=note: print w/replies, xml ) Need Help??


in reply to Re: How can I get a Unicode @ARGV?
in thread How can I get a Unicode @ARGV?

I don't sure about this.. as I didn't see any proper / possible place to insert an intercept to investigate... however, if I open a cmd console, the default chcp is 950, and when I run my perl code, with Win32::Codepage, it is still telling me I am working with cp950.

However, if I left a clean cmd open, and drop a file there, the unicode file/dir name can show correctly. Is that possible mean there is not much to deal with the codepage?

So I have a guess, what if the strings already turned to ANSI before able to pipe to my script?

Replies are listed 'Best First'.
Re^3: How can I get a Unicode @ARGV?
by nikosv (Deacon) on Aug 31, 2012 at 17:48 UTC
    cp950 is not Unicode, cp65001 is. When you do the drag and drop operation on the packaged executable an API call occurs which probably works with ANSI but even then you start with a UTF16 file which when you drop on it uses the Language for non-Unicode programs.So if the file path is in Japanese and the system page is cp950 what happens is UTF16 -> cp950 which is Big5/Chinese not Japanese and the Unicode mapping is not correct thus the question marks
      In cp950: dir/b It shows the files with unicode char correctly. However, dir/b > list.txt the content inside list turns into "????"

      In cp65001: dir/b It shows the files with monster char. However, dir/b > list.txt gives the correct list.

      Most confused me is that when you can see the String right, it doesn't mean the Data is right, vice versa. And I actually have no idea why Unicode chars can show correctly when dir (or drop a file path in the cmd console) in cp950, but then can't manipulate(@ARGV) later.

        Hello.

        Can you do this ? or not ?

        1. Paste the function below into your script.
        2. This will ouput encoding information of $ARGV[0] to logfile.txt

        troubled_string($ARGV[0], "logfile.txt");
        3. Examine logfile.txt with your browser, and find normal string, changing "Encodings" of your browser.
        sub troubled_string{ my ($str,$logfile_path)=@_; use Encode qw(decode encode from_to encodings); open (my $fh, ">", $logfile_path) or die $!; printf $fh "utf8 flag:%s\n",utf8::is_utf8($str) ? "utf8 flagged" : + "not utf8 flagged"; printf $fh "hexdump:[%s]\n",utf8::is_utf8($str) ? unpack('U0H*', $ +str):unpack('H*', $s tr); printf $fh "length:[%s]\n", length($str); printf $fh "%s\n", "-" x 20; if ( utf8::is_utf8($str) ){ printf $fh "[encode trial]\n"; printf $fh "%-25s:%s\n",$_,encode($_, $str) for (Encode->encod +ings(":all")); } else { printf $fh "[decode -> encode trial]\n"; printf $fh "%-25s:%s\n",$_,encode($_, decode($_,$str)) for (En +code->encodings(":all")); } close $logfile_path; }
        What does it say?

        the 'dir' command in Win cmd works in Unicode regardless the code page;it's one of those Windows quirks.Go ahead change the code page to eg Cyrillic and try it out,you'll get the same result

        In cp950: dir/b It shows the files with Unicode char correctly. However, dir/b > list.txt the content inside list turns into "????"

        Is list.txt saved as ANSI (default)?. Save list.txt as Unicode and try again

        In cp65001: dir/b It shows the files with monster char. However, dir/b > list.txt gives the correct list.

        I guess cp65001 sets the file i/o to Unicode,that is why you see list.txt with correct list what is a monster char? maybe a font issue? what font are you using, Lucida Console?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://991005]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-03-19 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found