http://www.perlmonks.org?node_id=11138556


in reply to Re^2: Dealing with files with differing line endings
in thread Dealing with files with differing line endings

As a practical matter, I am sure that you are right. However, it is important to know that there are corner cases. Consider the following contrived example.
use strict; use warnings; use Test::More tests=>1; my $file = \do{ "This \n is not the end of a line on windows\r\n" }; open my $fh1, '<:raw', $file; my $chars_read = length(<$fh1>); close $fh1; my $chars_expected=47; is( $chars_read, $chars_expected, 'record length' );

OUTPUT:

1..1 not ok 1 - record length # Failed test 'record length' # at nl.pl line 15. # got: '6' # expected: '47' # Looks like you failed 1 test of 1.

Unfortunately, my solution (use :crlf instead of :raw) does not work either.

Bill

Replies are listed 'Best First'.
Re^4: Dealing with files with differing line endings
by Marshall (Canon) on Nov 11, 2021 at 21:19 UTC
    On Windows, "This \n is not the end of a line on windows\r\n" will be written as:
    "This <CR><LF> is not the end of a line on windows<CR><CR><LF>". On output, on a Windows platform, each "\n" is translated into two characters <CR><LF>.

    If you read with the standard Windows I/O layer, which I think is :CRLF, the first <CR><LF> will be translated to just <LF>. The <CR><CR><LF> will result in <CR><LF>. So, yes, the first "\n" in your example is indeed recognized as "end of line on Windows". Length 6 is correct(4+1+1): This=4,a space=1, <LF>=1 characters.

    I've never used the :raw layer and there appears to be some weirdness here. I'm not sure what "read a "line" with :raw" actually means? It looks to me like it removed the <CR> just like the normal layer would. When reading binary data, I have always set binmode and read up to a requested number of bytes into a buffer without the concept of a "line ending".

    If you change your example to have the first "\n" be a "\r", then indeed that single <CR> will not be recognized as an "end of line" by Windows.

    When I came across an ancient Mac <CR> line terminated file, the problem was apparent because Windows read the whole file as one line. I just talked with the user, explained the issue and a simple editor setting solved the problem.

    I never did come up with a "clean, high performance" way to handle any variation of Mac, Unix and Windows line endings easily. I just don't support old Mac and that is good enough for my users.

      I've never used the :raw layer and there appears to be some weirdness here. I'm not sure what "read a "line" with :raw" actually means? It looks to me like it removed the <CR> just like the normal layer would. When reading binary data, I have always set binmode and read up to a requested number of bytes into a buffer without the concept of a "line ending".

      Setting just the :raw layer should be equivalent to setting binmode (if not, either the documentation in PerlIO or the implementation is broken), and resulting in a completely unmodified byte stream between perl and whatever is at the other end of the handle (file, socket, pipe, ...). readline will then behave as on Unix: neither \r nor \n are translated in any way when writing, and neither <CR> nor <LF> will be translated in any way when writing (Update:) reading. This results in perl treating <LF> as \n when reading, and so all text having <CR><LF> line endings will be read as ending with \r\n, chomp with a default $/ of "\n" will chew of just the \n and leave the line read with a trailing \r (read as <CR>).

      See also "newlines" in perlport, and ":raw" in PerlIO.

      I never did come up with a "clean, high performance" way to handle any variation of Mac, Unix and Windows line endings easily. I just don't support old Mac and that is good enough for my users.

      $text=~s/\r\n/\n/g; $text=~s/\r/\n/g; should convert CRLF (DOS) to just LF (Unix), then bare CR (Mac) to just LF, resulting in $text having \n exactly at every line ending, no matter if the input had MAC, DOS, or Unix line endings. In a file with mixed line endings (which is IMHO a broken file), this may accidentally remove a few empty lines.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        I experimented more with this (code attached below). Answering my own question - when reading a line using "<:raw", Perl is looking for a <LF> to determine the "end of line". This is the same thing that it does with the :CRLF layer. The difference is that with :raw, the CR (if any) immediately before the LF is not removed.

        The terminology does get confusing because "\n" as written in Perl on Windows sometimes means <CR><LF> and sometimes it means only <LF>.

        With the normal I/O layer, the <CR> in <CR><LF> will be removed before your Perl code ever sees the line. chomp() only operates on <LF>, not <CR><LF>.

        Running two regexes as you suggest is not necessary, the standard I/O layer does this part: $text=~s/\r\n/\n/g; (remove any <CR> that immediately precedes a <LF>). Translating <CR> to <LF> would get the multiple lines contained within the input string into "normal line format".

        So the rub here is that there is no easy way to say "give me a line" no matter old Mac,unix or windows. $/, the input record separator, is a string, not a regex. When you attempt to read a line from a file with <CR> terminated lines, you will get the entire file, not just one line because readline is looking for <LF>. Now having in effect slurped the entire file into one string variable, you can indeed split it up into "real lines". However now we have altered the program flow from reading a line at a time to reading the whole file into a buffer, modifying that buffer (perhaps with tr instead of regex) and then reading that buffer a line at a time.

        Anyway, I did not see the need to burden the 99.99999% code with special stuff for this ancient Mac. There are also some memory issues with reading entire files into memory to process them when line by line processing is desired. It would also be possible to read part of the file, determine that \r should be the input record separator, then back up and use that. But that is "complicated".

        I'm not working with Unix at the moment. But from memory, Perl code to read files line by line between Unix and Windows is the same. When reading a Windows file on Unix, the I/O layer zaps the <CR> and I never see it. When Windows reads a Unix file, it doesn't care that the <CR> isn't there. When writing a line on Unix, Perl writes a <LF> for "\n". When writing a line on Windows, Perl writes a <CR><LF> for "\n".

        Mixed line ending files can happen. When I was working on Unix, my environment allowed me to click on a remote Unix file and edit it with my local Windows editor. Only the lines that I modified wound up with <CR><LF> endings. My editor preserved the exiting <LF> terminated lines. Perl and GNU C didn't have an issue with this and I didn't really worry about it. LPR was fussy. I had some simple Perl thing that read a line, chomped it, then printed line with "\n" (which on output is platform specific). Now that I think about it, it could be that chomp() was unnecessary, the read of the <CR><LF> line would have zapped the <CR> already. There would be no need to remove the <LF> only to add it back in.

        Unix and Windows have <LF> in common and that works well. Ancient Mac with <CR> is a "weird duck".

        use strict; use warnings; ### setup input file <CR>=0d <LF>=0a open (my $fh,'>',"testfilein.txt") or die "$!"; print $fh "bbb \naaa \r\n"; #62626220 0d0a 61616120 0d0d0a close $fh; #note spaces are for human reading open ($fh,'>',"testfileout.txt") or die "$!"; binmode $fh; #read from input with std <>, write binary to output file open (my $fh2, '<', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0a 61616120 0d0a } close $fh2; close $fh; print "*** run two...\n"; # use same read file # this time use :raw layer for reading # A line ends in <LF> like above, but the <CR> before it # (if any) is not removed. open ($fh,'>',"testfileout2.txt") or die "$!"; binmode $fh; open ($fh2, '<:raw', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0d0a 61616120 0d0d0a } __END__ 5_bbb "bbb "+LF 4+1=5 |6_aaa "aaa "+CRLF 4+2=6 |*** run two... 6_bbb "bbb "+CRLF 4+2=6 |7_aaa "aaa "+CRCRLF 4+3=7