http://www.perlmonks.org?node_id=579275


in reply to Re: unicode in windows
in thread unicode in windows

Input looks like:
日曜日
月曜日
火曜日
水曜日
木曜日
金曜日
土曜日
saved in word pad as a unicode text document.

Mostly at the moment I'm thrashing around trying to get a handle. on how to do this. I'm trying to use the Unicode::String classes for input/output, but everything I've tried comes out with files different sizes and mojabake in file.

I've been beating on this for several hours. At this point I'd be happy with just reading the file into a list and writing it out and having the files be the same size and have same contents. Of course I don't mean just opening in bin mode and copying bytes from here to there. I need to use reg-ex to manipulate the contents, but first things first.

Replies are listed 'Best First'.
Re^3: unicode in windows
by ikegami (Patriarch) on Oct 19, 2006 at 06:10 UTC
    What have you tried? The following works for me:
    use strict; use warnings; open(my $fh_in, '<:raw:encoding(utf16le)', 'src.txt') or die("Unable to open src.txt: $!\n"); open(my $fh_out, '>:raw:encoding(utf16le)', 'dst.txt') or die("Unable to create file.txt: $!\n"); while (<$fh_in>) { if (/[\x{706B}\x{6C34}]/) { print("Found one at line $.!\n"); } print $fh_out $_; }

    The :raw prevents the CRLF->LF conversion when reading and the LF->CRLF when writting. The conversion only works with single-byte, ASCII-based (i.e. LF=0xA, CR=0xD) encodings.

    Use :raw:encoding(utf16le) if the file was saved using encoding "Unicode"
    Use :raw:encoding(utf16be) if the file was saved using encoding "Unicode big endian"
    Use :raw:utf8 if the file was saved using encoding "UTF-8"

    Update: Added example regexp to code.