Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows

by Jim (Curate)
on Oct 30, 2010 at 18:36 UTC ( #868492=note: print w/ replies, xml ) Need Help??


in reply to Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows
in thread Chicanery Needed to Handle Unicode Text on Microsoft Windows

As far as I can see, the only reason you need :crlf is because you've specifically added the UNIX line ending (\n) to your output.

:crlf is needed here to get the same platform-independent line-ending handling of plain text files Perl has always supported. Without it, the line-ending handling is badly broken. Half of the line-ending character pair CRLF is missed.

D:\>cat Demo.pl #!perl use strict; use warnings; open my $input_fh, '<:raw:perlio:encoding(UTF-16LE)', 'Input.txt'; while (my $line = <$input_fh>) { chomp $line; print "There's an unexpected/unwanted CR at the end of the line\n" if $line =~ m/\r$/; } D:\>file Input.txt Input.txt: Text file, Unicode little endian format D:\>cat Input.txt We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America. D:\>perl Demo.pl Input.txt There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line D:\>

And as Anonymous Monk has already pointed out, \n is the express mechanism in Perl intended to make line-ending handling platform-independent. It is defined not to mean the LF-only Unix line-ending, but rather to mean whatever the line-ending character or character combination terminates lines of plain text files on the platform in use.

It would be better to use the platform-independent $/.

No it wouldn't. And even if it were better, how would someone new to Perl ever figure that out. I've been programming Perl for years and I've never once seen $/ used in place of the usual and ordinary \n. chomp()-ing and "...\n"-ing are the long-lived and ubiquitous standard idioms.

#!perl print "Hello, world\n";
Except for ASCII files, binmode($file_handle) was required on MSWin32 systems. :raw performs the same function so, while perhaps appearing to add to the chicanery, it certainly reduces the amount of code.

But this is the whole point. The file named Input.txt is not a binary file; it's a plain text file. All the Unicode files I want to manipulate on Microsoft Windows using Perl, the text-processing scripting language, are plain text files. binmode() and :raw are lies. Chicanery.

In my humble opinion, this should work on a Unicode UTF-16 file with a byte order mark.

#!perl use strict; use warnings; open my $input_fh, '<', 'Input.txt'; open my $output_fh, '>', 'Output.txt'; while (my $line = <$input_fh>) { chomp $line; print $output_fh "$line\n"; }

It seems perfectly reasonable to me to expect the scripting language to determine the character encoding of the file all by its little lonesome it only has to read the first two bytes of the file and just to do the right thing.


Comment on Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
Select or Download Code
Re^3: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by kcott (Abbot) on Oct 31, 2010 at 02:07 UTC

    The documentation on :raw says that CRLF conversion is turned off. It appears that \n in the print statement is represented as CRLF before the arguments to print enter the output stream so \n can be used as normal.

    Changing $/ to \n in my tests (not surprisingly) produces the same results.

    -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://868492]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2014-12-28 14:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (181 votes), past polls