http://www.perlmonks.org?node_id=504295

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks!

I'm a bit lost with utf-8 conversion. For a FictionBook 2 eReader conversion script, I need to have "translations" for some UTF-8 characters to the appropriate eReader characters.

For this I used a part of the table found at eReader.com and stored it as a UTF8 file:

¡ ¡ ¡ \a161 Inverted exclamation ¢ ¢ ¢ \a162 Cent sign £ £ £ \a163 Pound sign : : skipped : œ œ œ \a156 Small combined oe Ÿ Ÿ Ÿ \a159 Large Y with diaeresis

Next I wanted to prepend the first character with it's UTF-8 unicode 4 digit code by using a oneliner (splitted here for better readability):

perl -i.bak -pe '\ binmode STDIN,":utf8"; \ binmode STDOUT,":utf8"; \ if (/^([^[:ascii:]])/) { \ $_= sprintf("%04x",ord $1).$_ \ }' pml.txt
Unfortunately I seem to miss something. I get data like this:
00c2¡ ¡ ¡ \a161 Inverted exclamation 00c2¢ ¢ ¢ \a162 Cent sign 00c2£ £ £ \a163 Pound sign
3 time 00c2 can't be true.

Do you see my mistake?


Update: Experimenting and reading perldoc perlrun, especially about -C led me to this version, which seems to work quite well:

perl -i.bar -CDS -pe ' \ if (/^([^[:ascii:]])/) { \ $_= sprintf("%04x",ord $1).$_ \ }' pml.txt

Update2: No... It still doesn't work


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: get UTF-8 character codes
by theorbtwo (Prior) on Oct 31, 2005 at 16:21 UTC

    Try running perl -i.bak -pe '42'.

    The implicit loop of -p reads from the same place <> does -- which is ARGV, not STDIN.

    </spolier>


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      What I understand now is that binmode STDOUT STDIN doesn't help.

      What I still don't get is: What does help.

      I'm still lost...

      Update: Thanks to Errto for correcting me. OTOH: I did that, replacing STDIN by ARGV. It didn't help.

      But it seems, my expectations were wrong. My first Update above seemed to have worked. I'm still investigating...


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        It's STDIN that's causing the problem, not STDOUT. You need to be working on ARGV instead instead of STDIN. Or, an easier way of dealing with it is, especially since this is a one-liner anyway, just add the following before -e:
        -Mopen=:utf8,:std
        See open for more.