Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

help needed in utf16

by uva (Sexton)
on Mar 27, 2006 at 13:40 UTC ( [id://539445]=perlquestion: print w/replies, xml ) Need Help??

uva has asked for the wisdom of the Perl Monks concerning the following question:

hai monks,
i tried to open the file in utf16 as i did with utf8.But it is giving error . Is there any possibility to open utf16 encoded files. Where can i get the modules for utf16.
update :
ERROR is "UTF-16:Unrecognised BOM efbb at d:/input.txt"

Replies are listed 'Best First'.
Re: help needed in utf16
by Corion (Patriarch) on Mar 27, 2006 at 13:53 UTC

    Did you try a Google search? It pointed me to this mail, which indicates that you can open a file as UTF16 like this:

    open FH, "<:encoding(utf16)", $filename or die "Couldn't open '$filename' : $!";

    I've never used anything of it, but I'm working from this line in the mail:

    ./perl -Ilib -we 'open(FH, "<:encoding(utf16)", "utf16");print <FH>'
Re: help needed in utf16
by graff (Chancellor) on Mar 27, 2006 at 23:36 UTC
    What is the error message that you get?

    update: After playing around with UTF-16 in perl, I have rearranged and modified the information below, and added some one-liners that I found instructive.

    If the data file comes from a "well-behaved" application, the first character will be a byte-order mark (BOM, "\x{feff}"), and using ":encoding(UTF-16)" on the file handle will always do the right thing.

    But if you use "UTF-16" and there is no BOM in the data, perl will complain, as shown below. (The quoting in the following one-liner examples assumes a "bash"-style shell, and I use unix "od" to view hex and character dumps of the output.)

    # example 1: see what perl produces for "UTF-16" output: $ perl -e 'binmode STDOUT, ":encoding(UTF-16)"; print "abc\n"' | od -t +xC -a 0000000 fe ff 00 61 00 62 00 63 00 0a + fe ff nul a nul b nul c nul nl + 0000012 # perl is well-behaved: it attaches a BOM when writing UTF-16 to a fil +e handle # example 1 was done on a mac powerbook (big-endian) # ... on an intel box the output would have been: # ff fe 61 00 62 00 63 00 0a 00 # ff fe a nul b nul c nul nl nul # example 2: see how perl reads "UTF-16" input: $ perl -e 'binmode STDOUT, ":encoding(UTF-16)"; print "abc\n"' | perl -e 'binmode STDIN, ":encoding(UTF-16)"; $_ = <>; print' | od -t +xC -a 0000000 61 62 63 0a + a b c nl + 0000004 # when reading UTF-16, perl removes the BOM from input and # converts data internally to utf8 (in this example, the # result is just ascii, because there were no wide characters) # example 3: what perl does when byte order is specified in the encodi +ng: $ perl -e 'binmode STDOUT, ":encoding(UTF-16BE)"; print "abc\n"' | od +-txC -a 0000000 00 61 00 62 00 63 00 0a + nul a nul b nul c nul nl + 0000010 # when byte order is specified, perl does not write a BOM # example 4: what perl does when reading data with no BOM: $ perl -e 'binmode STDOUT, ":encoding(UTF-16BE)"; print "abc\n"' | perl -e 'binmode STDIN, ":encoding(UTF-16)"; $_ = <>; print' UTF-16:Unrecognised BOM 61 at -e line 1. # if the reading script set "UTF-16BE" on STDIN, to match how # it was written, it would work correctly.

    SO: with utf16 data that has no BOM, it is often useful (sometimes generally necessary to look at a hex dump of the data to make sure you know what the byte order is, in case it might be different from the native byte order of your machine, and you have to tell perl what byte order to use.

    Any ASCII characters in your data (e.g. spaces, tabs, carriage returns, line-feeds, alphanumerics, etc) will have a null byte as the "high byte" of the 16-bit character value; if the null byte shows up at an even-numbered byte offset (where the first byte of the file is at offset "0"), the data is "big-endian", and if your machine is little-endian (i.e. "intel-like"), you need to specify "UTF-16BE" as the encoding when you open the file.

    On the other hand, if the null bytes show up at odd byte offsets, the data are little-endian, so if your machine is big-endian (mac or sparc-like), you need to use "UTF-16LE" as the encoding.

    There are CPAN modules for the BOM, but you can also check it yourself:

    #!/usr/bin/perl use strict; my $encoding = 'UTF-16'; my $Usage = "Usage: $0 [-BE|-LE] file.u16\n"; if ( @ARGV and $ARGV[0] =~ /^-([BL]E)$/ ) { $encoding .= $1; shift; } die $Usage unless ( @ARGV == 1 and -f $ARGV[0] ); my $filename = pop @ARGV; # if user didn't specify byte order, let's check the input file if ( $encoding eq 'UTF-16' ) { my $first_short; open( F, "<", $filename ) or die "$filename: $!"; my $n = sysread( F, $first_short, 2 ); die "sysread failed on $filename" unless ( $n == 2 ); if ( $first_short == pack( 'S', 0xfeff ) or $first_short == pack( 'S', 0xfffe ) { # it's a BOM, and using ":encoding(UTF-16)" is fine } else { die "$filename has no BOM; please specify byte order\n$Usage"; } } close F; open( F, "<:$encoding", $filename ); # ... and go to work...
    (this sample code has been heavily updated relative to initial posting, to include a usage statement, handling of an appropriate command-line option for byte order, proper use of "pack" to test for the BOM value, and proper handling when BOM is present or absent.)
Re: help needed in utf16
by SamCG (Hermit) on Mar 27, 2006 at 18:55 UTC


    I've successfully used:
    open FH, $file_name or die "could not open $file_name: $!\n"; binmode FH, ":encoding(UTF-16)";
    though I'd also point out that it's not recommended to use bare filehandles anymore. However, I'm not sure if not using the binmode function is actually your problem, since you don't describe your error. What I saw before using the binmode was the file would appear broken -- instead of "this\tis\tthe\file" (in a tab-delimited file), I'd see "t h i s i s t h e f i l e" (or something close to that).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://539445]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2025-01-21 22:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (62 votes). Check out past polls.