Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Character encoding of microns

by graff (Chancellor)
on Feb 08, 2009 at 02:40 UTC ( [id://742199]=note: print w/replies, xml ) Need Help??


in reply to Re: Character encoding of microns
in thread Character encoding of microns

It's a very good point to make that when trying to work out character encoding problems, you need to know what your display method is doing, as well as what your program is doing. That's why hex dumps of output are so useful (sad, but true).

But it's also worthwhile to understand the "?" output a little better. When any unicode-aware process (whether a perl script, display terminal, browser rendering engine, database client, database server, or whatever) is trying to convert from unicode to some other encoding, the standard default behavior is to replace a unicode character with "?" in case the output encoding does not have a character that maps to the given unicode code point.

When you see "?" in your outputs where you expect to see other characters, the first thing to do is to identify the point in the processing or display where unicode data has been converted to some other encoding.

When data is going the other direction (from some known or assumed "other" encoding), and the conversion process (wherever it is) sees input bytes or byte pairs that are not defined in the mapping table for the given non-unicode character set, it will put one or more "\x{fffd}" (the unicode "replacement character") in place of the uninterpretable parts in its output unicode string.

Replies are listed 'Best First'.
Re^3: Character encoding of microns
by joec_ (Scribe) on Feb 09, 2009 at 10:54 UTC
    Hi,

    Am i correct in assuming that the oracle encoding WE8ISO8859P1 is actually ISO-8859-1? In that case, am i also correct in assuming that perl automatically writes data as ISO-8859-1?

    Even if i decode ('ISO-8859-1',$clob); i still get question marks written for microns.

    I just tried a little experiment - in Notepad++ i wrote a single micron sign (Alt-0181). That displayed fine when the encoding is ANSI. When i changed it to utf-8, i got a box/splodge. When i open my actual file, and change the encoding from ANSI to utf-8, nothing happens. This is interesting, is it not?

    This problem is beginning to bug me now :).

    Any help appreciated.

    Joe

    UPDATE---

    clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with ยต in it' conv: 'this is string with ต in it' unix perlio encoding(utf8) utf8 clob: 'this is string with รยต in it' conv: 'this is string with ยต in it'
    That is the output of oshalla's code. It would seem that the first decode as utf8 seems to make it work, as long as you dont binmode stdout. after binmode the strange As start to appear.

    However, this is fine for this test string. But, my database output still has question marks in place of the micro signs

    update 2 i wrote a little c# program to grab the output from oracle and write it to a file. This had no problem and worked fine. In perl Binmode on stdout didnt affect anything and neither did use encoding 'utf8'

    any help appreciated guys

    -- joe

    ---

    Eschew obfuscation, espouse eludication!

      am i also correct in assuming that perl automatically writes data as ISO-8859-1?

      Not really. Perl outputs using whatever encoding you specify (via use open, binmode or some other means).

      If you don't specify, it outputs the internal representation of the string which is either arbitrary bytes of unknown encoding (UTF8 flag off) or a lax variant of UTF-8 called utf8 (UTF8 flag on). If the UTF8 flag is on, you might also get a warning.

      If you happen to pass iso-latin-1 characters to Perl and you print these out, Perl will output iso-latin-1. But the same goes for any encoding.

      # U+00E9 LATIN SMALL LETTER E WITH ACUTE # Second perl outputs iso-8859-1 $ perl -e'use open ":std", ":encoding(iso-8859-1)"; print chr(0x00E9)' + | perl -e"print <>" | od -t x1 0000000 e9 0000001 # U+0449 CYRILLIC SMALL LETTER SHCHA # Second perl outputs iso-8859-5 $ perl -e'use open ":std", ":encoding(iso-8859-5)"; print chr(0x0449)' + | perl -e"print <>" | od -t x1 0000000 e9 0000001

      However, many aspects of Perl will presume the arbitrary bytes of unknown encoding are iso-latin-1. This includes uc, regexp character classes such as \w, explicit upgrades to utf8 (utf8::upgrade($_)), and implicit upgrades to utf8 (chop( $_ . chr(0x2660) )).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://742199]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (8)
As of 2024-05-21 10:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found