Unicode Puzzle

by Skeeve (Vicar)
on Aug 12, 2010 at 21:16 UTC
Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I'm a bit puzzled. I'm runnning a file through openssl to decode it. A portion of the output, in which I'm interested, is said to be "in Unicode, padded with 00".

So when I look at it in a hexdump, it reads like this:

00 4d 00 61 00 67 00 6e - 00 65 00 74 00 00 00 00 [.M.a.g.n.e.t....]

There are several strings of bytes which I read from openssl's output. I unpack them into an array. Demo code could be something like this:

$buf="\x00M\x00a\x00g\x00n\x00e\x00t\x00\x00\x00\x00" x 4; my(@unpacked)= unpack "a16" x 4, $buf;

So now I wonder how to convert the bytestrings in @unpacked to proper perl strings which I can print out without the \x00. Additionally I'd like to remove the padded zeroes.

I tried to use decode_utf8 and decode("utf16", ...) on them, but the first one did not seem to have any impact and the latter one fails with (example) "UTF-16:Unrecognised BOM 4d"

Does anyone of you have a hint what I'm doing wrong?


Re: Unicode Puzzle
on Aug 12, 2010 at 21:27 UTC

    It appears to be UTF-16be or UCS-2be (no way to know from what you posted).

    UTF-16 has two possible byte orders, so telling decode just "UTF-16" is not enough unless there's a BOM to indicate byte order.

    The "padded with 00" bit probably refers to the two U+0000 at the end.

      Many thanks, ikegami! Both (utf16be and ucs2be) seem to work. The data I have does not help me yet in deciding, which one is the "real" one.

        if you can get iconv library and associated binaries for you platform, they're good tools for unicode inspection and test/real conversion.
        the hardest line to type correctly is: stty erase ^H
Re: Unicode Puzzle
on Aug 13, 2010 at 06:56 UTC
    ++ for including hexdump output in your question, and generally providing enough information to answer the question without huge guesswork.
Node Type: perlquestion
As of 2018-05-25 04:28 GMT
