Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Native newline encoding

by salva (Canon)
on May 22, 2012 at 15:49 UTC ( [id://971815]=perlquestion: print w/replies, xml ) Need Help??

salva has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a way to easyly detect the native newline encoding of the OS running my perl script. Besides using a table to infer it from $^O, is there any other way to detect it?

A CPAN module will also do, but searching on CPAN for newline, crlf, carriage return, etc. doesn't show anything interesting.

Replies are listed 'Best First'.
Re: Native newline encoding
by kennethk (Abbot) on May 22, 2012 at 15:58 UTC

    You can use PerlIO::get_layers(STDIN), as discussed in Querying the layers of filehandles in PerlIO. Assuming, of course, that no one played with the layers.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Native newline encoding
by BrowserUk (Patriarch) on May 22, 2012 at 16:49 UTC

    You can detect what Perl think it should output for "\n" on the current platform:

    open O, '>', \$fred;; { local( $/, $\ ); print O "\n";; };; close O;; print unpack 'H*', $fred;; 0a

    Maybe not so useful depending upon your purpose?

    Isn't the whole point of "\n", that by using it, you allow the runtime and OS to figure it out?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I am extending Net::SFTP::Server to implement version 4 of the SFTP protocol. That version supports opening files in TEXT mode (similar to FTP) and there are two ways to do it, first one is to convert the native new-lines to CRLF before sending through the network and the second one is to tell the client what the native newline sequence is and let it handle the burden of the conversion.

      At this point, it seems to me that the simple solution is the first one letting Perl read the file in text mode and then applying s/\n/\r\n/. This may be slightly incorrect in some edge cases (for instance, files on Windows with \n line endings) that nobody would care about so I don't either!

        That version supports opening files in TEXT mode (similar to FTP) and there are two ways to do it, first one is to convert the native new-lines to CRLF before sending through the network and the second one is to tell the client what the native newline sequence is and let it handle the burden of the conversion.

        Hm. My reading of the appropriate RFC is slightly different, in that the server can choose whether to send CRLF or a single char line ending of their choice:

        4.3 Determining Server Newline Convention In order to correctly process text files in a cross platform compatible way, the newline convention must be converted from that +of the server to that of the client, or, during an upload, from that o +f the client to that of the server. Versions 3 and prior of this protocol made no provisions for processing text files. Many clients implemented some sort of conversion algorithm, but without either a 'canonical' on the wire format or knowledge of the servers newline convention, correct conversion was not always possible. Starting with Version 4, the SSH_FXF_TEXT file open flag (Section 6.3) makes it possible to request that the server translate a file +to a 'canonical' on the wire format. This format uses \r\n as the lin +e separator. Servers for systems using multiple newline characters (for example, Mac OS X or VMS) or systems using counted records, MUST translate t +o the canonical form. However, to ease the burden of implementation on servers that use a single, simple separator sequence, the following extension allows t +he canonical format to be changed. string "newline" string new-canonical-separator (usually "\r" or "\n" or "\r\n" +) All clients MUST support this extension. When processing text files, clients SHOULD NOT translate any character or sequence that is not an exact match of the servers newline separator. In particular, if the newline sequence being used is the canonical "\r\n" sequence, a lone \r or a lone \n SHOULD be written through without change.

        And it is down to the clients to convert whatever the server sends to their required local form.

        At this point, it seems to me that the simple solution is the first one letting Perl read the file in text mode and then applying s/\n/\r\n/. This may be slightly incorrect in some edge cases (for instance, files on Windows with \n line endings) that nobody would care about so I don't either!

        I whole-heartedly agree, though I would approach that solution in a slightly different manner.

        When TEXT mode is requested:

        1. Open the file in text mode;
        2. Read the file line-by-line using the system default INPUT_SEPARATOR;
        3. chomp each line read;
        4. Write to the socket line-by-line; having set the OUTPUT_SEPARATOR to CRLF;

        This way, whatever the local line separator is, it gets taken care of by Perl (or the CRT of you're using XS). And the data is transmitted with the required 'canonical newlines'.

        Clients then do the same in reverse. Read from the socket line-by-line having set their INPUT_SEPARATOR to CRLF; chomp; and write line-by-line using the default OUTPUT_SEPARATOR for their local platform.

        This way, the conversions are taken care of at both ends by perl or the CRT. At least, for ascii/ANSi/ISO-whatever-that-number-is files that have the 'correct' newlines on the originating platforms.

        Things (will) get far more messy once the RFCs start dealing with Unicrap.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      Under Windoze 7:

      >perl -wMstrict -le "my $fred; open O, '>', \$fred;; { local( $/, $\ ); print O \"\n\";; };; close O;; print unpack 'H*', $fred;; " 0a

      I assume the result is the same on a *nix system. Anyone care to try the Mac?

        …result is the same on a *nix system…

        You mean like a Mac? :P

        perl -le 'open F,">",\$f; {local($/,$\); print F "\n"}; print unpack " +H*", $f' 0a

        Sorry, I missed your reply due to all the noise created by my erstwhile friend.

        Under Windoze 7:

        My demo was also run under Windows (Vista), so no surprise there :)

        I assume the result is the same on a *nix system.

        Indeed. And that was exactly the point of the demonstration. salva's a *nix man and knows I'm a windows user; so the significance would not be lost on him.

        Anyone care to try the Mac?

        Since modern macs are essentially *nix, it'll be the same there also. You'd have to go back to MacOS to see a difference I think.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      I don't think that really addresses the issue.

      One way to get what he wants would be to write "\n" to a file and then open that file, use binmode(), and then slurp it in and see what you get. It should be, for example, 0d0a on Windows and 0a on *nix.

      I have an inkling that it won't work that way with an in memory file...

      I don't really do Windows but I can check it on a machine with Strawberry Perl later.

      -sauoq
      "My two cents aren't worth a dime.";
        It should be, for example, 0d0a on Windows ... I don't really do Windows

        That's a bit obvious :)

        It isn't perl(*) that writes the extra character; it is the C runtime (when writing to a data file opened as text). Those extra characters are also stripped by the CRT when reading -- assuming text mode.

        If Perl added them itself, then the CRT would also do it and you'd end up with a real mess.

        perl; and Perl programmers shouldn't need to concern themselves with the details, because -- unless they are reading text files in bin mode; which they shouldn't be -- the addition and removal of the 'extra characters' should be entirely transparent.

        (*) ignoring PerlIO which does; but only because it bypasses the CRT and then emulates it -- the point of which mystifies me, but there it is.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Native newline encoding
by sauoq (Abbot) on May 22, 2012 at 19:59 UTC
    Besides using a table to infer it from $^O, is there any other way to detect it?

    It's not neat, but you can actually write a newline to a file and then read it in binmode:

    use warnings; use strict; open FH, '>', 'out.txt' or die $!; print FH "\n"; close FH; open FH, '<', 'out.txt' or die $!; binmode(FH); my $stuff = do { local $/; <FH> }; print unpack('H*', $stuff), "\n";

    Tested on Windows, where it prints 0d0a and on Linux where it prints 0a.

    You could clean it up, use File::Temp... whatever. But, unfortunately, you can't get by with an in memory file or IO::Scalar.

    -sauoq
    "My two cents aren't worth a dime.";
      Unfortunately my module may run in environments where creating files and writing to them may be forbidden. Using an in-memory file doesn't seem to work.

      And what happens when the OP tries to use that information to process the line endings in this text file?

      C:\test>od -t x1 nonsense.txt 0000000 ff fe 55 00 6e 00 20 00 62 00 65 00 61 00 75 00 0000020 20 00 6a 00 6f 00 75 00 72 00 20 00 61 00 75 00 0000040 20 00 6d 00 69 00 6c 00 69 00 65 00 75 00 20 00 0000060 64 00 65 00 20 00 6c 00 61 00 20 00 6e 00 75 00 0000100 69 00 74 00 2c 00 20 00 0d 00 0a 00 64 00 65 00 0000120 75 00 78 00 20 00 6d 00 6f 00 72 00 74 00 73 00 0000140 20 00 6c 00 65 00 73 00 20 00 67 00 61 00 72 00 0000160 e7 00 6f 00 6e 00 73 00 20 00 73 00 65 00 20 00 0000200 6c 00 65 00 76 00 61 00 20 00 70 00 6f 00 75 00 0000220 72 00 20 00 6c 00 75 00 74 00 74 00 65 00 72 00 0000240 20 00 63 00 6f 00 6e 00 74 00 72 00 65 00 2c 00 0000260 20 00 0d 00 0a 00 64 00 6f 00 73 00 20 00 e0 00 0000300 20 00 64 00 6f 00 73 00 20 00 69 00 6c 00 73 00 0000320 20 00 73 00 65 00 20 00 73 00 6f 00 6e 00 74 00 0000340 20 00 61 00 66 00 66 00 72 00 6f 00 6e 00 74 00 0000360 e9 00 73 00 2c 00 20 00 0d 00 0a 00 61 00 20 00 0000400 61 00 70 00 70 00 65 00 6c 00 e9 00 20 00 6c 00 0000420 65 00 75 00 72 00 20 00 e9 00 70 00 e9 00 65 00 0000440 73 00 20 00 65 00 74 00 20 00 61 00 62 00 61 00 0000460 74 00 74 00 75 00 20 00 75 00 6e 00 20 00 64 00 0000500 65 00 20 00 6c 00 27 00 61 00 75 00 74 00 72 00 0000520 65 00 2c 00 20 00 0d 00 0a 00 6c 00 27 00 75 00 0000540 6e 00 20 00 e9 00 74 00 61 00 69 00 74 00 20 00 0000560 61 00 76 00 65 00 75 00 67 00 6c 00 65 00 20 00 0000600 65 00 74 00 20 00 6c 00 27 00 61 00 75 00 74 00 0000620 72 00 65 00 20 00 70 00 61 00 73 00 2c 00 20 00 0000640 0d 00 0a 00 76 00 6f 00 69 00 72 00 20 00 73 00 0000660 69 00 20 00 69 00 6c 00 73 00 20 00 6f 00 6e 00 0000700 74 00 20 00 63 00 68 00 6f 00 69 00 73 00 69 00 0000720 20 00 75 00 6e 00 20 00 6d 00 61 00 6e 00 6e 00 0000740 65 00 71 00 75 00 69 00 6e 00 20 00 70 00 6f 00 0000760 75 00 72 00 20 00 75 00 6e 00 20 00 61 00 72 00 0001000 62 00 69 00 74 00 72 00 65 00 2e 00 0001014

      Oh dear!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        And what happens when the OP tries to use that information to process the line endings in this text file?

        Why are you assuming he would? Or even that he would use it to process any line endings? He didn't say that's what his intention was. He just asked about how to find that information. Microsoft and you might want to try to hide that info but I don't see the point. And I do see reasons why one might want to get it.

        -sauoq
        "My two cents aren't worth a dime.";
Re: Native newline encoding
by sauoq (Abbot) on May 23, 2012 at 01:57 UTC
Re: Native newline encoding
by Anonymous Monk on May 22, 2012 at 21:30 UTC
      Interestingly, Encode::Newlines does it as follows:
      use constant Native => ( ($^O =~ /^(?:MSWin|cygwin|dos|os2)/) ? CRLF : ($^O =~ /^MacOS/) ? CR : LF );

      I guess that this, which some special handling for VMS where the line ending is set by file and EDBCDI systems where line endings are the lesser problem, should cover 99% of the cases.

      s/[\r\n]+$­//;

      I much prefer s/\s*$//; because one should never write new code that causes trailing whitespace to be significant.

      be strict in what you output  binmode  $fh, '...:crlf'­;

      That seems like something that is quite unlikely to be what one should do. That might make sense when trying to use a Unix system to write a text file that will be used by some MS Windows program(s).

      For the most common case, you should replace that 'binmode' code with this code:

      - tye        

        That seems like something that is quite unlikely to be what one should do. That might make sense when trying to use a Unix system to write a text file that will be used by some MS Windows program(s).

        You mean like this exact situation? user wants notepad.exe to open .ini file and for it to work?

        For the most common case, you should replace that 'binmode' code with this code:

        What code?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://971815]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-03-28 16:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found