Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Welcome to the Monastery
 
PerlMonks  

Converting UTF-16 files to UTF-8

by demerphq (Chancellor)
on May 16, 2007 at 14:48 UTC ( #615796=perlquestion: print w/ replies, xml ) Need Help??
demerphq has asked for the wisdom of the Perl Monks concerning the following question:

I would have thought the following (quick hack) script would work:

use strict; use warnings; my ($inf,$outf)= @ARGV; $inf or die "Must have a file to process\n" ; $outf or $outf= $inf.".utf8"; open my $in, "<:encoding(utf16)", $inf or die "Can't open '$inf':$!"; open my $out, ">:utf8", $outf or die "Can't write '$outf':$!"; local $/; # slurp mode! print {$out} <$in> # text or die "Failed to convert file:$!"; close $in or die "Something weird happened closing '$inf': $!"; close $out or die "Failed to close '$outf', file is probably corrupted: $!";

Or even the more elegant one liner:

perl -pe "BEGIN {binmode STDIN, ':encoding(utf16)'; binmode STDOUT, ': +utf8'}"

But it doesnt work. If I use an input file with a few (three) Ĕ in it (0x0114), saved in utf-16 by Ultraedit on win2k I end up with a file with the octets FF FE 14 01 14 01 14 01 and after conversion the output file has the octets EF BB BF C2 BE 00 14 00 01 00 14 00 01 00 14 00 01, which is just wrong. Can anybody spot what the problem is or is Perls Utf-16 support borked?

Note that this was with Perl 5.8.6 from ActiveState.

Update: Turns out that this was all down to a display bug in Ultraedit. Thanks for the help, and sorry for wasting anybody's time.

---
$world=~s/war/peace/g

Comment on Converting UTF-16 files to UTF-8
Select or Download Code
Re: Converting UTF-16 files to UTF-8
by ikegami (Pope) on May 16, 2007 at 15:11 UTC

    It works for me in 5.8.8. It should work in 5.8.6 too.

    >debug in File not found -e100 FF FE 01 14 01 14 01 14 -rcx CX 0000 :8 -w Writing 00008 bytes -q >perl 615796.pl in out >debug out -rcx CX 0009 : -d100 l9 137A:0100 E1 90 81 E1 90 81 E1 90-81 ......... -q

    By the way, I've had problems with using :encoding(...) and multi-byte character sets. I recommend :raw:encoding(...), but keep in mind it has the side effect of not adding/removing a CR that preceeds a LF.

      Given that you are using debug this is on windows I assume... But which perl? An AS build?

      ---
      $world=~s/war/peace/g

        Yes

        This is perl, v5.8.8 built for MSWin32-x86-multi-thread (with 50 registered patches, see perl -V for more detail) Copyright 1987-2006, Larry Wall Binary build 820 [274739] provided by ActiveState http://www.ActiveSta +te.com Built Jan 23 2007 15:57:46

        on WinXP Pro

      I recommend :raw:encoding(...), but keep in mind it has the side effect of not adding/removing a CR that preceeds a LF.

      Maybe it's worth mentioning (though it has nothing to do with the OP's problem) that - in some cases - you can work around this CR/LF issue by re-adding the Windows specific crlf PerlIO layer (which is being removed by :raw) in a different position in the layer stack, e.g.

      :raw:encoding(utf16le):crlf:utf8"

      as I described in more detail in this node.

Re: Converting UTF-16 files to UTF-8
by Krambambuli (Deacon) on May 16, 2007 at 15:22 UTC
    Works for me too; perl v5.8.8 built for i386-linux-thread-multi on Fedora Core release 6 (Zod).
Re: Converting UTF-16 files to UTF-8
by graff (Chancellor) on May 16, 2007 at 21:14 UTC
    ... I use an input file with a few (three) Ĕ in it (0x0114), saved in utf-16 by Ultraedit on win2k I end up with a file with the octets FF FE 01 14 01 14 01 14 ...

    Um... If you're using ActiveState on win2k, and you have actually shown those 8 octets in their true "logical" (file sequential) order, then I'm puzzled about the data you have created using "Ultraedit".

    The Byte Order Mark (BOM, \x{FEFF}) appears to be written in little-endian order (as we would expect for wintel), but if the next six byte pairs are supposed to be interpreted as "\x{0114}", they would have to be treated as big-endian.

    What's up with that? I'm as mystified as you as to why your initial output has all those null bytes, but it looks like a case of "garbage in, garbage out". Try using perl to generate your test data instead:

    perl -e 'binmode STDOUT,":encoding(utf16)"; print "\x{0114}\n"x3'
    Redirect that to a file, or pipe it directly to your elegant one-liner, and see if that gives you better results.

    (update: My "data generator" one-liner was done on unix; for mswin, you need to change single-quotes to doubles and vice-versa... but then the "\x{0114}" thing breaks. Oh well -- use a bash shell or put the script in a file.)

      then I'm puzzled about the data you have created

      Well it seems you have a very good eye. :-) That was a typo on my behalf, it is actually FF FE 14 01 14 01 14 01 ...

      Ill update my original node.

      ---
      $world=~s/war/peace/g

        Even with the change, I still don't get your result with ActivePerl 5.8.8 build 820
        >debug in File not found -e100 FF FE 14 01 14 01 14 01 -rcx CX 0000 :8 -w Writing 00008 bytes -q >perl 615796.pl in out >debug out -rcx CX 0006 : -d100 l6 137A:0100 C4 94 C4 94 C4 94 -q
Re: Converting UTF-16 files to UTF-8
by Errto (Vicar) on May 17, 2007 at 01:11 UTC

    In Windows world the default "Unicode" encoding is UTF-16 little endian, which I'm guessing is what UltraEdit saves. However, in Perl's Encode world, "utf16" without a byte-order specified assumes big-endian by default so you need to say "utf16le" if you mean the other thing.

    That said, it seems to me the presence of the BOM (which is mandatory) should mean, at least theoretically, that Perl could figure out for itself which one it is and it wouldn't go all kablooey. But I could be wrong.

    Update: It occurs to me that I may have this entirely backwards. But I think the general idea is at least close - that is, you're saving the file in one byte order and Perl's reading it in another.

      I don't know where this is documented (I'm sure it must be somewhere), but having played with it a bit, it seems that perl's "UTF-16" (no "BE" or "LE") means "There needs to be a BOM at the start".

      Perl will write a BOM on initial output to a file handle that is set for this encoding (and will use your machine's native byte order). On initial input, it will error out with "UTF-16:Unrecognized BOM xxxx" unless the first two bytes are either 0xFF 0xFE or 0xFE 0xFF. (And yes, if the first two bytes are one of those two pairings, it will use the given byte order.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://615796]
Approved by Tomte
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2014-04-18 23:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (472 votes), past polls