Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Character encoding of microns

by joec_ (Scribe)
on Feb 06, 2009 at 15:50 UTC ( #741927=perlquestion: print w/ replies, xml ) Need Help??
joec_ has asked for the wisdom of the Perl Monks concerning the following question:

Hi

This kind of question may have been asked before, but here goes:

I have a clob that is returned from an Oracle database with encoding WE8ISO8859P1. Contained in the clob are various scientific symbols such as the micron sign

This seems to display ok when i look at the table in TOAD, but when i fetch the data and write it to a file, the micron gets replaced with a question mark.

I have tried the following, with no success, using the Encode module.

use Encode; open (OUTPUT, ">$filename"); $clob = $sth->fetch; $convertedstr = decode("iso-8859-1",$clob); print OUTPUT $convertedstr;
But this still prints ? for the micron symbols. Do i need a different encoding or some such?

Thanks in advance.

Joe

---
Eschew obfuscation, espouse eludication!

Comment on Character encoding of microns
Download Code
Re: Character encoding of microns
by oshalla (Deacon) on Feb 06, 2009 at 16:06 UTC

    Have a look at what $clob and $convertedstr contain:

    bytes($clob) ; bytes($convertedstr) ; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;
    which may tell you where your microns are getting lost.

      Hi, i tried your bytes code with this data:

      use Encode; $clob = "this is string with [micro sign here] in it"; $convertedstr = decode("utf8",$clob); print $clob; print $convertedstr; bytes($clob) ; bytes($convertedstr) ; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;

      The output of which was:

      74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte

      74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8

      so, as you can see, it all matches up. Its interesting that i tried your code on my Mac at home so i will have to try it at work. I printed the text before / after conversion, and it prints ok (with micro symbol) before, but after using decode, displays ? on my Mac

      What does this mean then? Like i said, i will try your code at work, but currently the text displays ? before and after conversion. I use 'more' on linux at work and Notepad++ at work on Windows, both display ?

      Thanks

      Joe

      Eschew obfuscation, espouse eludication!

        As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see...

        What you are seeing when you print to STDOUT takes a little explaining...

        By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1).

        When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character.

        When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge.

        You can tell STDOUT that it's a UTF-8 file-handle using binmode, so:

        use strict ; use warnings ; use PerlIO ; use Encode; my $clob = "this is string with \x{C2}\x{B5} in it"; my $convertedstr = decode("utf8",$clob); print "clob: " ; bytes($clob) ; print "conv: " ; bytes($convertedstr) ; my @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; binmode(STDOUT, ":encoding(UTF-8)") ; @layers = PerlIO::get_layers(STDOUT) ; print "@layers\n" ; print "clob: '$clob'\n" ; print "conv: '$convertedstr'\n"; sub bytes { my ($s) = @_ ; my $w = utf8::is_utf8($s) ? "utf8" : "byte" ; use bytes ; print join(":", map(sprintf("%02X", $_), unpack('C*', $s))), " -- $w +\n" ; } ;
        where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces:
        clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte
        conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8
        unix perlio
        clob: 'this is string with  in it'
        conv: 'this is string with ▒ in it'
        unix perlio encoding(utf-8-strict) utf8
        clob: 'this is string with µ in it'
        conv: 'this is string with  in it'
        
        So now you're asking yourself, where the MUMBLE did the 'µ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is ''.

        The message is that you have to be consistent:

        • you can operate with byte strings that contain UTF-8 sequences, and provided you leave your file handles with no explicit encoding, those UTF-8 sequences will pass through untouched. Which is fine if the target device expects UTF-8 sequences.

          But, of course, those UTF-8 sequences will look like two (or more) LATIN1 characters if you process the strings.

        • you can operate with utf8 strings that contain "wide characters" (held internally as UTF-8 sequences, as it happens), and provided you set your file handles to :encoding(UTF-8) those wide characters will be encoded/decoded as they are output/input.

          You can also operate with byte strings that contain LATIN1 characters, and file handles set to :encoding(UTF-8) will encoded characters as they are output.

          Or you can leave you file handles with no explicit encoding, and encode/decode strings explicitly before output and after input.

        But if you try mixing the two, confusion will reign.

        See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.

Re: Character encoding of microns
by Anonymous Monk on Feb 06, 2009 at 16:11 UTC
Re: Character encoding of microns
by misterwhipple (Monk) on Feb 06, 2009 at 19:06 UTC
    Many programs can't display some characters properly, and will show a ? or something else in place of a problem character. Are you certain that the ? is being recorded in the file? Or, is problem in the program with which you're viewing it?

    cat >~/.sig </dev/interesting

      It's a very good point to make that when trying to work out character encoding problems, you need to know what your display method is doing, as well as what your program is doing. That's why hex dumps of output are so useful (sad, but true).

      But it's also worthwhile to understand the "?" output a little better. When any unicode-aware process (whether a perl script, display terminal, browser rendering engine, database client, database server, or whatever) is trying to convert from unicode to some other encoding, the standard default behavior is to replace a unicode character with "?" in case the output encoding does not have a character that maps to the given unicode code point.

      When you see "?" in your outputs where you expect to see other characters, the first thing to do is to identify the point in the processing or display where unicode data has been converted to some other encoding.

      When data is going the other direction (from some known or assumed "other" encoding), and the conversion process (wherever it is) sees input bytes or byte pairs that are not defined in the mapping table for the given non-unicode character set, it will put one or more "\x{fffd}" (the unicode "replacement character") in place of the uninterpretable parts in its output unicode string.

        Hi,

        Am i correct in assuming that the oracle encoding WE8ISO8859P1 is actually ISO-8859-1? In that case, am i also correct in assuming that perl automatically writes data as ISO-8859-1?

        Even if i decode ('ISO-8859-1',$clob); i still get question marks written for microns.

        I just tried a little experiment - in Notepad++ i wrote a single micron sign (Alt-0181). That displayed fine when the encoding is ANSI. When i changed it to utf-8, i got a box/splodge. When i open my actual file, and change the encoding from ANSI to utf-8, nothing happens. This is interesting, is it not?

        This problem is beginning to bug me now :).

        Any help appreciated.

        Joe

        UPDATE---

        clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' conv: 'this is string with in it' unix perlio encoding(utf8) utf8 clob: 'this is string with µ in it' conv: 'this is string with µ in it'
        That is the output of oshalla's code. It would seem that the first decode as utf8 seems to make it work, as long as you dont binmode stdout. after binmode the strange As start to appear.

        However, this is fine for this test string. But, my database output still has question marks in place of the micro signs

        update 2 i wrote a little c# program to grab the output from oracle and write it to a file. This had no problem and worked fine. In perl Binmode on stdout didnt affect anything and neither did use encoding 'utf8'

        any help appreciated guys

        -- joe

        ---

        Eschew obfuscation, espouse eludication!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://741927]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2014-07-29 02:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (211 votes), past polls