|laziness, impatience, and hubris|
Re^6: Dealing with files with differing line endingsby Marshall (Canon)
|on Nov 15, 2021 at 20:33 UTC||Need Help??|
I experimented more with this (code attached below). Answering my own question - when reading a line using "<:raw", Perl is looking for a <LF> to determine the "end of line". This is the same thing that it does with the :CRLF layer. The difference is that with :raw, the CR (if any) immediately before the LF is not removed.
The terminology does get confusing because "\n" as written in Perl on Windows sometimes means <CR><LF> and sometimes it means only <LF>.
With the normal I/O layer, the <CR> in <CR><LF> will be removed before your Perl code ever sees the line. chomp() only operates on <LF>, not <CR><LF>.
Running two regexes as you suggest is not necessary, the standard I/O layer does this part: $text=~s/\r\n/\n/g; (remove any <CR> that immediately precedes a <LF>). Translating <CR> to <LF> would get the multiple lines contained within the input string into "normal line format".
So the rub here is that there is no easy way to say "give me a line" no matter old Mac,unix or windows. $/, the input record separator, is a string, not a regex. When you attempt to read a line from a file with <CR> terminated lines, you will get the entire file, not just one line because readline is looking for <LF>. Now having in effect slurped the entire file into one string variable, you can indeed split it up into "real lines". However now we have altered the program flow from reading a line at a time to reading the whole file into a buffer, modifying that buffer (perhaps with tr instead of regex) and then reading that buffer a line at a time.
Anyway, I did not see the need to burden the 99.99999% code with special stuff for this ancient Mac. There are also some memory issues with reading entire files into memory to process them when line by line processing is desired. It would also be possible to read part of the file, determine that \r should be the input record separator, then back up and use that. But that is "complicated".
I'm not working with Unix at the moment. But from memory, Perl code to read files line by line between Unix and Windows is the same. When reading a Windows file on Unix, the I/O layer zaps the <CR> and I never see it. When Windows reads a Unix file, it doesn't care that the <CR> isn't there. When writing a line on Unix, Perl writes a <LF> for "\n". When writing a line on Windows, Perl writes a <CR><LF> for "\n".
Mixed line ending files can happen. When I was working on Unix, my environment allowed me to click on a remote Unix file and edit it with my local Windows editor. Only the lines that I modified wound up with <CR><LF> endings. My editor preserved the exiting <LF> terminated lines. Perl and GNU C didn't have an issue with this and I didn't really worry about it. LPR was fussy. I had some simple Perl thing that read a line, chomped it, then printed line with "\n" (which on output is platform specific). Now that I think about it, it could be that chomp() was unnecessary, the read of the <CR><LF> line would have zapped the <CR> already. There would be no need to remove the <LF> only to add it back in.
Unix and Windows have <LF> in common and that works well. Ancient Mac with <CR> is a "weird duck".