Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Unicode Puzzle

by Skeeve (Vicar)
on Aug 12, 2010 at 21:16 UTC ( #854769=perlquestion: print w/replies, xml ) Need Help??
Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I'm a bit puzzled. I'm runnning a file through openssl to decode it. A portion of the output, in which I'm interested, is said to be "in Unicode, padded with 00".

So when I look at it in a hexdump, it reads like this:

00 4d 00 61 00 67 00 6e - 00 65 00 74 00 00 00 00 [.M.a.g.n.e.t....]

There are several strings of bytes which I read from openssl's output. I unpack them into an array. Demo code could be something like this:

$buf="\x00M\x00a\x00g\x00n\x00e\x00t\x00\x00\x00\x00" x 4; my(@unpacked)= unpack "a16" x 4, $buf;

So now I wonder how to convert the bytestrings in @unpacked to proper perl strings which I can print out without the \x00. Additionally I'd like to remove the padded zeroes.

I tried to use decode_utf8 and decode("utf16", ...) on them, but the first one did not seem to have any impact and the latter one fails with (example) "UTF-16:Unrecognised BOM 4d"

Does anyone of you have a hint what I'm doing wrong?


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: Unicode Puzzle
by ikegami (Pope) on Aug 12, 2010 at 21:27 UTC

    It appears to be UTF-16be or UCS-2be (no way to know from what you posted).

    UTF-16 has two possible byte orders, so telling decode just "UTF-16" is not enough unless there's a BOM to indicate byte order.

    The "padded with 00" bit probably refers to the two U+0000 at the end.

      Many thanks, ikegami! Both (utf16be and ucs2be) seem to work. The data I have does not help me yet in deciding, which one is the "real" one.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        if you can get iconv library and associated binaries for you platform, they're good tools for unicode inspection and test/real conversion.
        the hardest line to type correctly is: stty erase ^H
Re: Unicode Puzzle
by moritz (Cardinal) on Aug 13, 2010 at 06:56 UTC
    ++ for including hexdump output in your question, and generally providing enough information to answer the question without huge guesswork.
    Perl 6 - links to (nearly) everything that is Perl 6.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://854769]
Approved by ikegami
help
Chatterbox?
[Corion]: Hehe - $work is a place where we have lots of (money) accounts, and lots of journals where every transaction is recorded. But our HR system where the accounts of hours worked and vacation days taken are stored, there is no real account of who changed ...
[Corion]: ... that balance, and when. And it seems to me that they somehow really messed up the database since the start of the year and have been frantically adding and subtracting numbers from the totals, but without trace ;)

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2018-01-23 16:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (249 votes). Check out past polls.

    Notices?