Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I experimented more with this (code attached below). Answering my own question - when reading a line using "<:raw", Perl is looking for a <LF> to determine the "end of line". This is the same thing that it does with the :CRLF layer. The difference is that with :raw, the CR (if any) immediately before the LF is not removed.

The terminology does get confusing because "\n" as written in Perl on Windows sometimes means <CR><LF> and sometimes it means only <LF>.

With the normal I/O layer, the <CR> in <CR><LF> will be removed before your Perl code ever sees the line. chomp() only operates on <LF>, not <CR><LF>.

Running two regexes as you suggest is not necessary, the standard I/O layer does this part: $text=~s/\r\n/\n/g; (remove any <CR> that immediately precedes a <LF>). Translating <CR> to <LF> would get the multiple lines contained within the input string into "normal line format".

So the rub here is that there is no easy way to say "give me a line" no matter old Mac,unix or windows. $/, the input record separator, is a string, not a regex. When you attempt to read a line from a file with <CR> terminated lines, you will get the entire file, not just one line because readline is looking for <LF>. Now having in effect slurped the entire file into one string variable, you can indeed split it up into "real lines". However now we have altered the program flow from reading a line at a time to reading the whole file into a buffer, modifying that buffer (perhaps with tr instead of regex) and then reading that buffer a line at a time.

Anyway, I did not see the need to burden the 99.99999% code with special stuff for this ancient Mac. There are also some memory issues with reading entire files into memory to process them when line by line processing is desired. It would also be possible to read part of the file, determine that \r should be the input record separator, then back up and use that. But that is "complicated".

I'm not working with Unix at the moment. But from memory, Perl code to read files line by line between Unix and Windows is the same. When reading a Windows file on Unix, the I/O layer zaps the <CR> and I never see it. When Windows reads a Unix file, it doesn't care that the <CR> isn't there. When writing a line on Unix, Perl writes a <LF> for "\n". When writing a line on Windows, Perl writes a <CR><LF> for "\n".

Mixed line ending files can happen. When I was working on Unix, my environment allowed me to click on a remote Unix file and edit it with my local Windows editor. Only the lines that I modified wound up with <CR><LF> endings. My editor preserved the exiting <LF> terminated lines. Perl and GNU C didn't have an issue with this and I didn't really worry about it. LPR was fussy. I had some simple Perl thing that read a line, chomped it, then printed line with "\n" (which on output is platform specific). Now that I think about it, it could be that chomp() was unnecessary, the read of the <CR><LF> line would have zapped the <CR> already. There would be no need to remove the <LF> only to add it back in.

Unix and Windows have <LF> in common and that works well. Ancient Mac with <CR> is a "weird duck".

use strict; use warnings; ### setup input file <CR>=0d <LF>=0a open (my $fh,'>',"testfilein.txt") or die "$!"; print $fh "bbb \naaa \r\n"; #62626220 0d0a 61616120 0d0d0a close $fh; #note spaces are for human reading open ($fh,'>',"testfileout.txt") or die "$!"; binmode $fh; #read from input with std <>, write binary to output file open (my $fh2, '<', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0a 61616120 0d0a } close $fh2; close $fh; print "*** run two...\n"; # use same read file # this time use :raw layer for reading # A line ends in <LF> like above, but the <CR> before it # (if any) is not removed. open ($fh,'>',"testfileout2.txt") or die "$!"; binmode $fh; open ($fh2, '<:raw', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0d0a 61616120 0d0d0a } __END__ 5_bbb "bbb "+LF 4+1=5 |6_aaa "aaa "+CRLF 4+2=6 |*** run two... 6_bbb "bbb "+CRLF 4+2=6 |7_aaa "aaa "+CRCRLF 4+3=7

In reply to Re^6: Dealing with files with differing line endings by Marshall
in thread Dealing with files with differing line endings by dd-b

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2022-08-13 09:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?