Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Quick and portable way to determine line-ending string?

by bikeNomad (Priest)
on Aug 09, 2001 at 02:21 UTC ( #103256=perlquestion: print w/ replies, xml ) Need Help??
bikeNomad has asked for the wisdom of the Perl Monks concerning the following question:

You all know, I'm sure, about how platform-specific input and output translation can happen to the "\n" character (on some systems). That is, when reading from a real disk file, there is a sequence of one or more character codes that gets translated into a "\n" character when read (assuming you don't call binmode). The reverse happens on a write to a physical file.

But what do you do when you have to duplicate this functionality? Say you have a buffer full of bytes that came from somewhere other than a disk file (so you can't use binmode) and you want to split it into lines of text. How do you know what character sequence is used for line endings on a particular operating system?

One obvious way to do this is to use a hash keyed by the value of $^O. But this quite ugly, requires me to know what all the line endings are, and breaks when someone comes up with a new port of Perl to a different operating system.

Another way might be to write a "\n" to a file in text mode and read it back in in binary mode, but this requires the existence of a writable file (which is not guaranteed) and is slow.

Has anyone come up with a better way to do this than the two methods mentioned above?

Comment on Quick and portable way to determine line-ending string?
Re: Quick and portable way to determine line-ending string?
by nardo (Friar) on Aug 09, 2001 at 02:55 UTC
    Most (all?) platforms will use one or more of 0x0a and 0x0d characters as the newline, so something like:
    sub seperator { my $cr = chr(0x0d); my $lf = chr(0x0a); if($_[0] =~ /([$cr$lf]+)/o) { return $1; } #unable to find seperator, handle error here. }
    This assumes that the file will not contain an 0x0a or 0x0d unless it is used as part of the newline, which should be true of a textfile.

    Update If the first occurance of the newline is multiple newlines (for example "\n\n\nThree lines before me\n") then all of them will match and it will not return the correct line seperator. Best to just check for $cr$lf, $lf$cr, $cr, and $lf individually.
      Sadly, one can't assume that. For instance, I have seen a number of cases where a supposedly-text file from a Unix system has been edited on a MS-DOSish system and hence contains extra "\x0d" characters.

      Then, of course, there's the question of what to do on EBCDIC systems, where the line endings are likely to be something entirely different.

        If the file contains 0x0d characters then they are characters which are supposed to be part of the line seperator and will be caught by the code I wrote. While many unix tools will see the 0x0d as just another character, if you do infact have a file which has 0x0d 0x0a pairs, you probably want 0x0d 0x0a to be your line seperator. If you had mixed 0x0d 0x0a and just 0x0a line seperators then the code won't work but so long as it is consistent it should be fine.

        On an EBCDIC system, the line endings are probably "\r" and "\n", of course. And there is no point in using "\x0a" and "\x0c" in the previous code. The only use for "\x0a" and "\x0d" are when you might run under MacOS and are using something like a network protocol that requires "\r\n". MacOS made the mistake of changing the definition of "\r" and "\n" rather than translating them. All other system that use non-Unix line endings, _translate_ to/from "\n".

        If it weren't for MacOS, "\r" and "\n" would always be the right choice. The move toward "\x0a" and "\x0c" has been motivated by trying to be portable with MacOS and has caused great confusion. Since very few Perl programmers actually work on the even weirder systems like those that use EBCDIC, the folly of this has not been widely noted (CGI.pm is one of the few places that I've seen start to notice this).

                - tye (but my friends call me "Tye")
        And Unicode files that use the new linebreak/parabreak characters! Say it's a UTF-8 encoded file... no 0x0A in sight!

        Like Perl itself, you need to be leniant about reading linebreaks. But you need to know the proper form for writing them.

Re: Quick and portable way to determine line-ending string?
by John M. Dlugosz (Monsignor) on Aug 09, 2001 at 02:58 UTC
    Open a file and write "\n" to it. Re-open the file and use binmode, then read it back in. The result is the desired string.

    Something like this...

    open $foo, $tempfilename; print $foo "\n"; close $foo; # probably redundant, but why not. open $foo, $tempfilename; binmode $foo; read $foo, $result, 999; close $foo;
    If you don't have a writable file (why not a valid temp directory??!) use a filehandle tied to a text buffer. Say, IO::Scalar.

    Hmm, I tried that and it didn't work, as I half-suspected. The binmode thing is done in the C Standard library functions, and Perl might be relying on that and have no real knowledge of what it means on a given platform.

    Are you sure you can't come up with a writable file, or a fake file that operates on the FD level rather than Perl's tie level?

    —John

      I mentioned doing what you suggest in my original post.

      I had thought about IO::Scalar, but it no-op's binmode(). So you can't use that.

      I believe the translation only happens on real physical disk files (I don't know about pipes, but there's no guarantee that a given platform will have pipes available).

      I'd prefer to stay away from having to write to files, because many of A::Z's users are using it in web servers; writing to a file requires knowing where a temp file can be made, and might slow things down.

        I'm pretty sure it will have effect any time the underlying system's file IO is used, whether it's a real file or whatever.

        As for slowing things down, does the line ending ever change? Figure it out once, and remember it. You could even make that part of installation.

        Here is another idea. You can have a known file containing the line ending candidate and just read it. See which one was transformed into a simple "\n".

        So, open (without binmode) lineend_mac and read it; ditto for lineend_pc and see which was a "\n". Neither really contained just a "\n".

        —John

Re: Quick and portable way to determine line-ending string?
by mr_mischief (Prior) on Aug 09, 2001 at 03:27 UTC
    You hit on a good way to do it in your original post. Pre-populating a hash is effective and fairly simple, even if it is a bit messy. By putting it in a module and putting that on CPAN, the whole Perl community can help with submissions on what their strange systems use for line endings. It might be a little bit of overkill to use a module computationally, but this situation is more data intensive than computationally intensive. Data gathering is always done best by the experts in the particular areas (in this case, differing platforms), with the results being reported to a central repository or person. CPAN or some other joint project system is ideal for this kind of task.

    Chris

    Update: Changed some awkward wording and fixed a tpyo.
      Hmm, maybe that can be part of the Config module? See $Config{trnl} which may already be what we want. Hmm, but mine shows "\\012" (literal backslash) on a Windows system, so maybe it's not actually used.
        I get the same on Windows (Win98 SE, Perl 5.6.0 ActiveState build 623). I get q{\n} on Linux with 5.6.0 though. However, if it's not dependable, then it certainly can't be used in this sort of case. Perhaps this is something p5p could go over. If I think of it, I'll send an email in the morning. I can't think how to word it right now. ;-)

        Chris
Re: Quick and portable way to determine line-ending string?
by mdillon (Priest) on Aug 09, 2001 at 03:34 UTC

    can't you split on the value of $/? (whose value is presumably related to $^O in the Perl source code and hence will always be in synch):

    my @lines = split m#$/#, $content;

    or how about just splitting on any line ending?:

    my @lines = split m#\x0d\x0a?|\x0a#, $content;

    to get around the multiple 0x0d problem, you could add \x0d+\x0a to the alternation as the first alternative (though it will slow things down on a Unix file with a lot of blank lines). come to think of it, \x0d{2}\x0a might be a better idea.

    for EBCDIC, i think the first solution i mentioned should work.

    for some reason, i feel like i'm missing something fundamental about your question, so if i'm just spouting crazy-talk, please ignore me.

      $/ is set to "\n" by default. Which doesn't answer the question of "what is the representation of "\n" in external text files?".

      This is the kind of thing that'll probably work but I was trying to avoid because it's hard to track and get right:

      my $nativeSeparator = "\n"; if ($^O =~ /MSWin32|dos|os2|cygwin/) { # not sure what to do about cygwin here. $nativeSeparator = "\x0d\x0a" } elsif ($^O eq 'MacOS') { $nativeSeparator = "\x0d" } elsif ($^O eq 'VMS') { # it depends on file type... what to do? } elsif (ord('A') eq 193) { # what to do for EBCDIC? "\n" may be OK... }

        I think EBCDIC is going to be a pain. For one thing, many mainframe systems assumes fixed length records with no newline separator. I thought that different mainframes would use different characters to determine a newline. \x15 and \x25 are two that are used. Also, according to the documentation for Convert::EBCDIC, there is a standard EBCDIC and a version used for OS390 (which may account for the different line endings).

        One problem there is that the EBCDIC Newline doesn't really translate to the ASCII CR or LF. Further, since the 'newline' varies on ASCII systems, I can only imagine that it's going to vary on EBCDIC systems. Admittedly, it's been a while since my mainframe days (no, I wasn't a Y2K boy), but I doubt you'll find a truly universal solution without the user choosing how their newline gets translated.

        Here's an interesting chart of the EBCDIC characters. What the heck is a "Required newline" (\x06)? I sure as heck don't remember that.

        Good luck.

        Cheers,
        Ovid

        Vote for paco!

        Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Quick and portable way to determine line-ending string?
by tachyon (Chancellor) on Aug 09, 2001 at 04:46 UTC

    This appears to work. We use the DATA filehandle so need no permission as Perl opens this for us.

    seek DATA,-6,1; # back up into __DATA__ string binmode DATA; $end = <DATA>; $end =~ s/.*__//; # delete everything except the line ending for(split//,$end){printf "0x%x\n",ord $_} __DATA__ # prints (on Win32) 0xd 0xa

    Make sure there is a \n after the __DATA__

    Does this port?

    This also works as you would expect:

    my $tmp = 'c:/tmp.tmp'; open TMP, "+>$tmp" or die $!; print TMP "\n"; seek TMP, 0, 0; binmode TMP; for(split//,<TMP>){printf "0x%x\n",ord $_} close TMP; unlink $tmp; __END__ #prints (on Win32) 0xd 0xa

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      That will tell you the line ending found in that file, which is where it was last edited before being installed. perl will tolerate all kinds of stuff, but that's not necessarily the native line ending of the system it's running on now. That is, you could plop the same file onto a Mac or a PC and still see the same value.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://103256]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (16)
As of 2014-07-23 10:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (140 votes), past polls