Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re^2: How can I get a Unicode @ARGV?

by exilepanda (Pilgrim)
on Aug 31, 2012 at 13:28 UTC ( #991005=note: print w/replies, xml ) Need Help??

in reply to Re: How can I get a Unicode @ARGV?
in thread How can I get a Unicode @ARGV?

I don't sure about this.. as I didn't see any proper / possible place to insert an intercept to investigate... however, if I open a cmd console, the default chcp is 950, and when I run my perl code, with Win32::Codepage, it is still telling me I am working with cp950.

However, if I left a clean cmd open, and drop a file there, the unicode file/dir name can show correctly. Is that possible mean there is not much to deal with the codepage?

So I have a guess, what if the strings already turned to ANSI before able to pipe to my script?

Replies are listed 'Best First'.
Re^3: How can I get a Unicode @ARGV?
by nikosv (Chaplain) on Aug 31, 2012 at 17:48 UTC
    cp950 is not Unicode, cp65001 is. When you do the drag and drop operation on the packaged executable an API call occurs which probably works with ANSI but even then you start with a UTF16 file which when you drop on it uses the Language for non-Unicode programs.So if the file path is in Japanese and the system page is cp950 what happens is UTF16 -> cp950 which is Big5/Chinese not Japanese and the Unicode mapping is not correct thus the question marks
      In cp950: dir/b It shows the files with unicode char correctly. However, dir/b > list.txt the content inside list turns into "????"

      In cp65001: dir/b It shows the files with monster char. However, dir/b > list.txt gives the correct list.

      Most confused me is that when you can see the String right, it doesn't mean the Data is right, vice versa. And I actually have no idea why Unicode chars can show correctly when dir (or drop a file path in the cmd console) in cp950, but then can't manipulate(@ARGV) later.


        Can you do this ? or not ?

        1. Paste the function below into your script.
        2. This will ouput encoding information of $ARGV[0] to logfile.txt

        troubled_string($ARGV[0], "logfile.txt");
        3. Examine logfile.txt with your browser, and find normal string, changing "Encodings" of your browser.
        sub troubled_string{ my ($str,$logfile_path)=@_; use Encode qw(decode encode from_to encodings); open (my $fh, ">", $logfile_path) or die $!; printf $fh "utf8 flag:%s\n",utf8::is_utf8($str) ? "utf8 flagged" : + "not utf8 flagged"; printf $fh "hexdump:[%s]\n",utf8::is_utf8($str) ? unpack('U0H*', $ +str):unpack('H*', $s tr); printf $fh "length:[%s]\n", length($str); printf $fh "%s\n", "-" x 20; if ( utf8::is_utf8($str) ){ printf $fh "[encode trial]\n"; printf $fh "%-25s:%s\n",$_,encode($_, $str) for (Encode->encod +ings(":all")); } else { printf $fh "[decode -> encode trial]\n"; printf $fh "%-25s:%s\n",$_,encode($_, decode($_,$str)) for (En +code->encodings(":all")); } close $logfile_path; }
        What does it say?

        the 'dir' command in Win cmd works in Unicode regardless the code page;it's one of those Windows quirks.Go ahead change the code page to eg Cyrillic and try it out,you'll get the same result

        In cp950: dir/b It shows the files with Unicode char correctly. However, dir/b > list.txt the content inside list turns into "????"

        Is list.txt saved as ANSI (default)?. Save list.txt as Unicode and try again

        In cp65001: dir/b It shows the files with monster char. However, dir/b > list.txt gives the correct list.

        I guess cp65001 sets the file i/o to Unicode,that is why you see list.txt with correct list what is a monster char? maybe a font issue? what font are you using, Lucida Console?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://991005]
[Corion]: Meh. $effin_bad_system has an interface breakdown and then loads events in parallel with events overtaking one another instead of being processed sequentially
[shmem]: Discipulus: dunno, but we do all the time ^^
[choroba]: Discipulus I was taught so by a Londoner
[shmem]: Corion: very clear case of missing sequence number
[Corion]: shmem: Yeah. I guess they have a sequence number but distribute the events across threads or machines or whatever.
[karlgoethebier]: choroba: another chapter of "Learning English At The Monastry"?
[shmem]: Corion, well then... next issue, sequence number not a shared resource :P
[Discipulus]: shmem i'm searching it.. but failing i was sure was in Re: Let's Make PerlMonks Great Again! -- suggestions and dreams
erix recommends Vanished Kingdoms
[Corion]: shmem: Yeah, something like that. Not that that would be a solved issue. Simply process all events that come in from a single interface sequentially. Ah well.

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (13)
As of 2017-05-23 08:25 GMT
Find Nodes?
    Voting Booth?