Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^5: Dealing with files with differing line endings

by afoken (Canon)
on Nov 12, 2021 at 13:54 UTC ( #11138758=note: print w/replies, xml ) Need Help??

in reply to Re^4: Dealing with files with differing line endings
in thread Dealing with files with differing line endings

I've never used the :raw layer and there appears to be some weirdness here. I'm not sure what "read a "line" with :raw" actually means? It looks to me like it removed the <CR> just like the normal layer would. When reading binary data, I have always set binmode and read up to a requested number of bytes into a buffer without the concept of a "line ending".

Setting just the :raw layer should be equivalent to setting binmode (if not, either the documentation in PerlIO or the implementation is broken), and resulting in a completely unmodified byte stream between perl and whatever is at the other end of the handle (file, socket, pipe, ...). readline will then behave as on Unix: neither \r nor \n are translated in any way when writing, and neither <CR> nor <LF> will be translated in any way when writing (Update:) reading. This results in perl treating <LF> as \n when reading, and so all text having <CR><LF> line endings will be read as ending with \r\n, chomp with a default $/ of "\n" will chew of just the \n and leave the line read with a trailing \r (read as <CR>).

See also "newlines" in perlport, and ":raw" in PerlIO.

I never did come up with a "clean, high performance" way to handle any variation of Mac, Unix and Windows line endings easily. I just don't support old Mac and that is good enough for my users.

$text=~s/\r\n/\n/g; $text=~s/\r/\n/g; should convert CRLF (DOS) to just LF (Unix), then bare CR (Mac) to just LF, resulting in $text having \n exactly at every line ending, no matter if the input had MAC, DOS, or Unix line endings. In a file with mixed line endings (which is IMHO a broken file), this may accidentally remove a few empty lines.


Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^6: Dealing with files with differing line endings
by Marshall (Canon) on Nov 15, 2021 at 20:33 UTC
    I experimented more with this (code attached below). Answering my own question - when reading a line using "<:raw", Perl is looking for a <LF> to determine the "end of line". This is the same thing that it does with the :CRLF layer. The difference is that with :raw, the CR (if any) immediately before the LF is not removed.

    The terminology does get confusing because "\n" as written in Perl on Windows sometimes means <CR><LF> and sometimes it means only <LF>.

    With the normal I/O layer, the <CR> in <CR><LF> will be removed before your Perl code ever sees the line. chomp() only operates on <LF>, not <CR><LF>.

    Running two regexes as you suggest is not necessary, the standard I/O layer does this part: $text=~s/\r\n/\n/g; (remove any <CR> that immediately precedes a <LF>). Translating <CR> to <LF> would get the multiple lines contained within the input string into "normal line format".

    So the rub here is that there is no easy way to say "give me a line" no matter old Mac,unix or windows. $/, the input record separator, is a string, not a regex. When you attempt to read a line from a file with <CR> terminated lines, you will get the entire file, not just one line because readline is looking for <LF>. Now having in effect slurped the entire file into one string variable, you can indeed split it up into "real lines". However now we have altered the program flow from reading a line at a time to reading the whole file into a buffer, modifying that buffer (perhaps with tr instead of regex) and then reading that buffer a line at a time.

    Anyway, I did not see the need to burden the 99.99999% code with special stuff for this ancient Mac. There are also some memory issues with reading entire files into memory to process them when line by line processing is desired. It would also be possible to read part of the file, determine that \r should be the input record separator, then back up and use that. But that is "complicated".

    I'm not working with Unix at the moment. But from memory, Perl code to read files line by line between Unix and Windows is the same. When reading a Windows file on Unix, the I/O layer zaps the <CR> and I never see it. When Windows reads a Unix file, it doesn't care that the <CR> isn't there. When writing a line on Unix, Perl writes a <LF> for "\n". When writing a line on Windows, Perl writes a <CR><LF> for "\n".

    Mixed line ending files can happen. When I was working on Unix, my environment allowed me to click on a remote Unix file and edit it with my local Windows editor. Only the lines that I modified wound up with <CR><LF> endings. My editor preserved the exiting <LF> terminated lines. Perl and GNU C didn't have an issue with this and I didn't really worry about it. LPR was fussy. I had some simple Perl thing that read a line, chomped it, then printed line with "\n" (which on output is platform specific). Now that I think about it, it could be that chomp() was unnecessary, the read of the <CR><LF> line would have zapped the <CR> already. There would be no need to remove the <LF> only to add it back in.

    Unix and Windows have <LF> in common and that works well. Ancient Mac with <CR> is a "weird duck".

    use strict; use warnings; ### setup input file <CR>=0d <LF>=0a open (my $fh,'>',"testfilein.txt") or die "$!"; print $fh "bbb \naaa \r\n"; #62626220 0d0a 61616120 0d0d0a close $fh; #note spaces are for human reading open ($fh,'>',"testfileout.txt") or die "$!"; binmode $fh; #read from input with std <>, write binary to output file open (my $fh2, '<', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0a 61616120 0d0a } close $fh2; close $fh; print "*** run two...\n"; # use same read file # this time use :raw layer for reading # A line ends in <LF> like above, but the <CR> before it # (if any) is not removed. open ($fh,'>',"testfileout2.txt") or die "$!"; binmode $fh; open ($fh2, '<:raw', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0d0a 61616120 0d0d0a } __END__ 5_bbb "bbb "+LF 4+1=5 |6_aaa "aaa "+CRLF 4+2=6 |*** run two... 6_bbb "bbb "+CRLF 4+2=6 |7_aaa "aaa "+CRCRLF 4+3=7
      when reading a line using "<:raw", Perl is looking for a LF to determine the "end of line"

      It's looking for $/.

      chomp() only operates on LF, not CRLF.

      chomp operates on whatever $/ is set to, including if that's set to CRLF for whatever unusual reason.

      When reading a Windows file on Unix, the I/O layer zaps the CR and I never see it.

      Only if you explicitly specify the :crlf layer, which your code doesn't do.

      The terminology does get confusing because "\n" as written in Perl on Windows sometimes means CRLF and sometimes it means only LF.

      For about the millionth time: No. Maybe it's finally time to read Newlines in perlport?

        Only if you explicitly specify the :crlf layer, which your code doesn't do.

        Strawberry perl.exe adds the :crlf layer unless you tell it otherwise.

        C:\usr\local\share>perl -MConfig -MPerlIO -le "print for PerlIO::get_l +ayers(STDIN), '-'x10, $Config{myuname}" unix crlf ---------- Win32 strawberry-perl #1 Thu May 23 12:20:46 2019 x64

        (I haven't used Active State since the early aughts, so I cannot tell you how the other major Windows port of perl behaves, though my vague recollections were that I had never heard of IO layers back then, but that newlines just worked right, as they do with modern Strawberry, so I am assuming they also set :crlf for you.)

        update: hmm, you even knew it was on by default on Windows in Re^9: How do I display only matches (from the other conversation you alluded to), so I have to assume I've missed something in the context of this thread. I don't see anything in the posted code that would override that (other than the :raw open, of course)... so I'm more confused than when I first posted this. :-( Maybe I've had too hard of a day, and I should stop trying at this point. Time to go home! :-)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11138758]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2022-05-21 07:26 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (76 votes). Check out past polls.