Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows

in reply to Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows
in thread Chicanery Needed to Handle Unicode Text on Microsoft Windows

As far as I can see, the only reason you need :crlf is because you've specifically added the UNIX line ending (\n) to your output.

:crlf is needed here to get the same platform-independent line-ending handling of plain text files Perl has always supported. Without it, the line-ending handling is badly broken. Half of the line-ending character pair CRLF is missed.

D:\>cat Demo.pl
#!perl

use strict;
use warnings;

open my $input_fh, '<:raw:perlio:encoding(UTF-16LE)', 'Input.txt';

while (my $line = <$input_fh>) {
    chomp $line;
    print "There's an unexpected/unwanted CR at the end of the line\n"
        if $line =~ m/\r$/;
}

D:\>file Input.txt
Input.txt:      Text file, Unicode little endian format

D:\>cat Input.txt
We the People of the United States, in Order to form a more perfect
Union, establish Justice, insure domestic Tranquility, provide for
the common defence, promote the general Welfare, and secure the
Blessings of Liberty to ourselves and our Posterity, do ordain and
establish this Constitution for the United States of America.

D:\>perl Demo.pl Input.txt
There's an unexpected/unwanted CR at the end of the line
There's an unexpected/unwanted CR at the end of the line
There's an unexpected/unwanted CR at the end of the line
There's an unexpected/unwanted CR at the end of the line
There's an unexpected/unwanted CR at the end of the line

D:\>
[download]

And as Anonymous Monk has already pointed out, \n is the express mechanism in Perl intended to make line-ending handling platform-independent. It is defined not to mean the LF-only Unix line-ending, but rather to mean whatever the line-ending character or character combination terminates lines of plain text files on the platform in use.

It would be better to use the platform-independent $/.

No it wouldn't. And even if it were better, how would someone new to Perl ever figure that out. I've been programming Perl for years and I've never once seen $/ used in place of the usual and ordinary \n. chomp()-ing and "...\n"-ing are the long-lived and ubiquitous standard idioms.

#!perl
print "Hello, world\n";
[download]

Except for ASCII files, binmode($file_handle) was required on MSWin32 systems. :raw performs the same function so, while perhaps appearing to add to the chicanery, it certainly reduces the amount of code.

But this is the whole point. The file named Input.txt is not a binary file; it's a plain text file. All the Unicode files I want to manipulate on Microsoft Windows using Perl, the text-processing scripting language, are plain text files. binmode() and :raw are lies. Chicanery.

In my humble opinion, this should work on a Unicode UTF-16 file with a byte order mark.

#!perl

use strict;
use warnings;

open my $input_fh,  '<', 'Input.txt';
open my $output_fh, '>', 'Output.txt';

while (my $line = <$input_fh>) {
    chomp $line;
    print $output_fh "$line\n";
}
[download]

It seems perfectly reasonable to me to expect the scripting language to determine the character encoding of the file all by its little lonesome — it only has to read the first two bytes of the file — and just to do the right thing.

In Section Seekers of Perl Wisdom