http://www.perlmonks.org?node_id=829346


in reply to Re: Text File Encoding under Windows
in thread Text File Encoding under Windows

Thanks, almut -

This solved part of the problem. The file is now getting read in OK and the expected regex matches occur. However, the output is still causing problem because the non-ASCII characters in it do not get represented correctly. I have tried the following two approaches:

1) Printing out to the DOS console and redirecting the output from there into a file. The result looks fine - if it were not for the special characters that get represented as EF,FC etc.

2) Printing to a UTF-16 encoded file with the following code:
#! /usr/bin/perl -w use strict; use locale; open INPUT, "<:encoding(UTF-16LE)", $ARGV[0]; open OUTPUT, ">:encoding(UTF-16LE)", "./Output_UTF-16"; while ( <INPUT> ) { # long list of regex-based replacements print OUTPUT $_; }
The result was an output file which represented all special characters correctly but contained a line of empty boxes in every second line.

Can you please advise what I need to do to fix both output variants?

Thanks again for your help!
Pat

Replies are listed 'Best First'.
Re^3: Text File Encoding under Windows
by almut (Canon) on Mar 18, 2010 at 09:35 UTC

    What is the desired (or required) output encoding, i.e. which program are you using to view or further process the output? (can it handle UTF-16?)

    What special characters are involved; may they also be represented in a non-unicode legacy encoding such as ISO-8859-1 (ISO-Latin1) or Windows CP1252?

    Maybe just try other output encodings

    open OUTPUT, ">:encoding(UTF-8)", ... open OUTPUT, ">:encoding(CP1252)", ... open OUTPUT, ">:encoding(ISO-8859-1)"... ...

    The latter two should be used in combination with :encoding(UTF-16) on the input side, because that would swallow the BOM, which you don't want in non-unicode output  (in case of UTF-8 the BOM is optional, so you can decide for yourself).

    P.S.: can you view the original input file correctly with the same program that's showing the empty boxes with the output file?  (btw, do you really mean "line" in "a line of empty boxes in every second line", or rather character "column"? — the former would be kinda strange...)

      Thanks, almut -

      This was very helpful! I managed to find out that UTF-8 output encoding in fact worked fine and all the special characters displayed correctly. The application operating on the modified files (of which I did not know which encoding it required) accepted the input thus created.

      Thanks again for your help! Problem resolved.

      Cheers -

      Pat