Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Dealing with files with differing line endings

by Marshall (Canon)
on Nov 06, 2021 at 23:20 UTC ( #11138532=note: print w/replies, xml ) Need Help??


in reply to Re: Dealing with files with differing line endings
in thread Dealing with files with differing line endings

We may be overthinking this. ikegami's solution should be fine. The exception is that ancient Mac which uses <CR> instead of <CR><LF> or <LF> for line endings. One of my users was using an old Mac to edit one of my config files and reported that my config file "didn't work". I talked with this guy and told him to set his text editor to "write DOS compatible files" and that ended the problem. Modern Macs use <LF>. Unless there is a specific strange requirement, writing code to handle ancient Mac is not worth the effort.
  • Comment on Re^2: Dealing with files with differing line endings

Replies are listed 'Best First'.
Re^3: Dealing with files with differing line endings
by BillKSmith (Monsignor) on Nov 07, 2021 at 21:22 UTC
    As a practical matter, I am sure that you are right. However, it is important to know that there are corner cases. Consider the following contrived example.
    use strict; use warnings; use Test::More tests=>1; my $file = \do{ "This \n is not the end of a line on windows\r\n" }; open my $fh1, '<:raw', $file; my $chars_read = length(<$fh1>); close $fh1; my $chars_expected=47; is( $chars_read, $chars_expected, 'record length' );

    OUTPUT:

    1..1 not ok 1 - record length # Failed test 'record length' # at nl.pl line 15. # got: '6' # expected: '47' # Looks like you failed 1 test of 1.

    Unfortunately, my solution (use :crlf instead of :raw) does not work either.

    Bill
      On Windows, "This \n is not the end of a line on windows\r\n" will be written as:
      "This <CR><LF> is not the end of a line on windows<CR><CR><LF>". On output, on a Windows platform, each "\n" is translated into two characters <CR><LF>.

      If you read with the standard Windows I/O layer, which I think is :CRLF, the first <CR><LF> will be translated to just <LF>. The <CR><CR><LF> will result in <CR><LF>. So, yes, the first "\n" in your example is indeed recognized as "end of line on Windows". Length 6 is correct(4+1+1): This=4,a space=1, <LF>=1 characters.

      I've never used the :raw layer and there appears to be some weirdness here. I'm not sure what "read a "line" with :raw" actually means? It looks to me like it removed the <CR> just like the normal layer would. When reading binary data, I have always set binmode and read up to a requested number of bytes into a buffer without the concept of a "line ending".

      If you change your example to have the first "\n" be a "\r", then indeed that single <CR> will not be recognized as an "end of line" by Windows.

      When I came across an ancient Mac <CR> line terminated file, the problem was apparent because Windows read the whole file as one line. I just talked with the user, explained the issue and a simple editor setting solved the problem.

      I never did come up with a "clean, high performance" way to handle any variation of Mac, Unix and Windows line endings easily. I just don't support old Mac and that is good enough for my users.

        I've never used the :raw layer and there appears to be some weirdness here. I'm not sure what "read a "line" with :raw" actually means? It looks to me like it removed the <CR> just like the normal layer would. When reading binary data, I have always set binmode and read up to a requested number of bytes into a buffer without the concept of a "line ending".

        Setting just the :raw layer should be equivalent to setting binmode (if not, either the documentation in PerlIO or the implementation is broken), and resulting in a completely unmodified byte stream between perl and whatever is at the other end of the handle (file, socket, pipe, ...). readline will then behave as on Unix: neither \r nor \n are translated in any way when writing, and neither <CR> nor <LF> will be translated in any way when writing (Update:) reading. This results in perl treating <LF> as \n when reading, and so all text having <CR><LF> line endings will be read as ending with \r\n, chomp with a default $/ of "\n" will chew of just the \n and leave the line read with a trailing \r (read as <CR>).

        See also "newlines" in perlport, and ":raw" in PerlIO.

        I never did come up with a "clean, high performance" way to handle any variation of Mac, Unix and Windows line endings easily. I just don't support old Mac and that is good enough for my users.

        $text=~s/\r\n/\n/g; $text=~s/\r/\n/g; should convert CRLF (DOS) to just LF (Unix), then bare CR (Mac) to just LF, resulting in $text having \n exactly at every line ending, no matter if the input had MAC, DOS, or Unix line endings. In a file with mixed line endings (which is IMHO a broken file), this may accidentally remove a few empty lines.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11138532]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2022-05-21 06:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (76 votes). Check out past polls.

    Notices?