Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re^4: Character encoding of microns

by joec_ (Scribe)
on Feb 10, 2009 at 11:16 UTC ( [id://742727]=note: print w/replies, xml ) Need Help??

in reply to Re^3: Character encoding of microns
in thread Character encoding of microns

Hi, Im grateful for your detailed explanation. But, i am still having problems.

If i run your code, with the micron encoded as \x{C2}\x{B5} then just using decode('utf8',$clob) seems to work. As you can see from the first set of clob/conv strings below, after the bytes stuff.

clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' conv: 'this is string with in it' unix perlio encoding(utf8) utf8 clob: 'this is string with µ in it' conv: 'this is string with µ in it'
However if i actually type a micron into the string using Alt-0181 then i get the following output: note i turned use diagnostics on.
clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:B5: +20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:EF:B +F:BD:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with in it' Wide character in print at line 19 (#1) (W utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The eas +iest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode. conv: 'this is string with � in it' unix perlio encoding(utf8) utf8 clob: 'this is string with µ in it' conv: 'this is string with � in it'

That last conv string is i assume your splodge? Perhaps then as no question marks are being output, this is not an encoding problem at all?

I honestly do appreciate all your time


Eschew obfuscation, espouse eludication!

Replies are listed 'Best First'.
Re^5: Character encoding of microns
by almut (Canon) on Feb 10, 2009 at 14:56 UTC
    if i actually type a micron into the string using Alt-0181 then i get the following output...

    Apparently, your editor is operating in ISO-Latin1 mode and is entering the micron as a single byte (181 decimal = B5 hex).

    You're then telling Perl that this string is UTF-8 (i.e. the decode("utf8",$clob) statement from oshalla's code), which is incorrect. For this reason, the conversion (silently) fails and the incorrect part (B5 does not start a valid UTF-8 encoding sequence here) is being replaced by the unicode replacement character U+FFFD, which when encoded as UTF-8 produces the three-byte sequence EF BF BD.

    When you interpret/display those three bytes as ISO-Latin1 characters they appear as "�", i.e. = EF, = BF, = BD. This is how I (and I suppose everyone else, too) see them in your post, because the PM site isn't unicode aware. If your terminal displays those same three characters, this just means it isn't unicode aware either...

    IOW, everything behaves as expected. :)


      So, how would i get round the problem of question marks being both displayed in my terminal for microns and also in any output that is written to a file? When i open my output file in a hex editor, a 3F is displayed for the question mark - indicating that an actual ? is written and it isnt a foreign character. No strange chars like above show up.

      Im think im hitting a brick wall with this.



      Eschew obfuscation, espouse eludication!
        The ASCII question mark is typically what you get when something tries to convert some unicode character into some non-unicode character set that does not contain the character in question.

        For example, the following script will produce "foo??", because the string literal has unicode Cyrillic for the fourth and fifth characters, but perl is being told to convert it to iso-8859-1 (Latin-1), which does not contain any Cyrillic characters -- that is, the unicode code points for Cyrillic cannot be mapped into the single-byte character codes for Latin-1, so the conversion produces "?" instead.

        perl -MEncode -le '$_=encode("iso-8859-1","foo\x{041d}\x{0418}"); prin +t'
        It's not just perl that does this. Anything/everything that supports conversion between unicode and other encodings will behave the same way when faced with the same inappropriate task.

        To figure out where the question marks are coming from, figure out the last point where the data were in unicode, and what sort of bad assumption is being made at that point to convert the encoding to something else.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://742727]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-05-26 06:37 GMT
Find Nodes?
    Voting Booth?

    No recent polls found