http://www.perlmonks.org?node_id=742394


in reply to Re^2: Character encoding of microns
in thread Character encoding of microns

Hi,

Am i correct in assuming that the oracle encoding WE8ISO8859P1 is actually ISO-8859-1? In that case, am i also correct in assuming that perl automatically writes data as ISO-8859-1?

Even if i decode ('ISO-8859-1',$clob); i still get question marks written for microns.

I just tried a little experiment - in Notepad++ i wrote a single micron sign (Alt-0181). That displayed fine when the encoding is ANSI. When i changed it to utf-8, i got a box/splodge. When i open my actual file, and change the encoding from ANSI to utf-8, nothing happens. This is interesting, is it not?

This problem is beginning to bug me now :).

Any help appreciated.

Joe

UPDATE---

clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with ยต in it' conv: 'this is string with ต in it' unix perlio encoding(utf8) utf8 clob: 'this is string with รยต in it' conv: 'this is string with ยต in it'
That is the output of oshalla's code. It would seem that the first decode as utf8 seems to make it work, as long as you dont binmode stdout. after binmode the strange As start to appear.

However, this is fine for this test string. But, my database output still has question marks in place of the micro signs

update 2 i wrote a little c# program to grab the output from oracle and write it to a file. This had no problem and worked fine. In perl Binmode on stdout didnt affect anything and neither did use encoding 'utf8'

any help appreciated guys

-- joe

---

Eschew obfuscation, espouse eludication!

Replies are listed 'Best First'.
Re^4: Character encoding of microns
by ikegami (Patriarch) on Feb 10, 2009 at 15:37 UTC

    am i also correct in assuming that perl automatically writes data as ISO-8859-1?

    Not really. Perl outputs using whatever encoding you specify (via use open, binmode or some other means).

    If you don't specify, it outputs the internal representation of the string which is either arbitrary bytes of unknown encoding (UTF8 flag off) or a lax variant of UTF-8 called utf8 (UTF8 flag on). If the UTF8 flag is on, you might also get a warning.

    If you happen to pass iso-latin-1 characters to Perl and you print these out, Perl will output iso-latin-1. But the same goes for any encoding.

    # U+00E9 LATIN SMALL LETTER E WITH ACUTE # Second perl outputs iso-8859-1 $ perl -e'use open ":std", ":encoding(iso-8859-1)"; print chr(0x00E9)' + | perl -e"print <>" | od -t x1 0000000 e9 0000001 # U+0449 CYRILLIC SMALL LETTER SHCHA # Second perl outputs iso-8859-5 $ perl -e'use open ":std", ":encoding(iso-8859-5)"; print chr(0x0449)' + | perl -e"print <>" | od -t x1 0000000 e9 0000001

    However, many aspects of Perl will presume the arbitrary bytes of unknown encoding are iso-latin-1. This includes uc, regexp character classes such as \w, explicit upgrades to utf8 (utf8::upgrade($_)), and implicit upgrades to utf8 (chop( $_ . chr(0x2660) )).