in reply to Dealing with files with differing line endings

A general solution is impossible. Any file can contain normal text characters that another OS would interpret as line separators. You may be able to assume that this will never happen with your data. Your idea of slurping the entire file (in binmode) into a string is probably the safest. Use anything you know about the file (line length, number of lines, words that only occur at the start or end of a line, etc) to determine which kind of file it is. Open the string as a memory file with the appropriate IO layer. You could then use the <> operator exactly as you normally would.
  • Comment on Re: Dealing with files with differing line endings

Replies are listed 'Best First'.
Re^2: Dealing with files with differing line endings
by Marshall (Canon) on Nov 06, 2021 at 23:20 UTC
    We may be overthinking this. ikegami's solution should be fine. The exception is that ancient Mac which uses <CR> instead of <CR><LF> or <LF> for line endings. One of my users was using an old Mac to edit one of my config files and reported that my config file "didn't work". I talked with this guy and told him to set his text editor to "write DOS compatible files" and that ended the problem. Modern Macs use <LF>. Unless there is a specific strange requirement, writing code to handle ancient Mac is not worth the effort.
      As a practical matter, I am sure that you are right. However, it is important to know that there are corner cases. Consider the following contrived example.
      use strict; use warnings; use Test::More tests=>1; my $file = \do{ "This \n is not the end of a line on windows\r\n" }; open my $fh1, '<:raw', $file; my $chars_read = length(<$fh1>); close $fh1; my $chars_expected=47; is( $chars_read, $chars_expected, 'record length' );


      1..1 not ok 1 - record length # Failed test 'record length' # at line 15. # got: '6' # expected: '47' # Looks like you failed 1 test of 1.

      Unfortunately, my solution (use :crlf instead of :raw) does not work either.

        On Windows, "This \n is not the end of a line on windows\r\n" will be written as:
        "This <CR><LF> is not the end of a line on windows<CR><CR><LF>". On output, on a Windows platform, each "\n" is translated into two characters <CR><LF>.

        If you read with the standard Windows I/O layer, which I think is :CRLF, the first <CR><LF> will be translated to just <LF>. The <CR><CR><LF> will result in <CR><LF>. So, yes, the first "\n" in your example is indeed recognized as "end of line on Windows". Length 6 is correct(4+1+1): This=4,a space=1, <LF>=1 characters.

        I've never used the :raw layer and there appears to be some weirdness here. I'm not sure what "read a "line" with :raw" actually means? It looks to me like it removed the <CR> just like the normal layer would. When reading binary data, I have always set binmode and read up to a requested number of bytes into a buffer without the concept of a "line ending".

        If you change your example to have the first "\n" be a "\r", then indeed that single <CR> will not be recognized as an "end of line" by Windows.

        When I came across an ancient Mac <CR> line terminated file, the problem was apparent because Windows read the whole file as one line. I just talked with the user, explained the issue and a simple editor setting solved the problem.

        I never did come up with a "clean, high performance" way to handle any variation of Mac, Unix and Windows line endings easily. I just don't support old Mac and that is good enough for my users.