Re: Quick and portable way to determine line-ending string?
by mr_mischief (Monsignor) on Aug 09, 2001 at 03:27 UTC
|
You hit on a good way to do it in your original post. Pre-populating a hash is effective and fairly simple, even if it is a bit messy. By putting it in a module and putting that on CPAN, the whole Perl community can help with submissions on what their strange systems use for line endings. It might be a little bit of overkill to use a module computationally, but this situation is more data intensive than computationally intensive. Data gathering is always done best by the experts in the particular areas (in this case, differing platforms), with the results being reported to a central repository or person. CPAN or some other joint project system is ideal for this kind of task.
Chris
Update: Changed some awkward wording and fixed a tpyo. | [reply] |
|
Hmm, maybe that can be part of the Config module? See $Config{trnl} which may already be what we want. Hmm, but mine shows "\\012" (literal backslash) on a Windows system, so maybe it's not actually used.
| [reply] |
|
I get the same on Windows (Win98 SE, Perl 5.6.0 ActiveState build 623). I get q{\n} on Linux with 5.6.0 though. However, if it's not dependable, then it certainly can't be used in this sort of case. Perhaps this is something p5p could go over. If I think of it, I'll send an email in the morning.
I can't think how to word it right now. ;-)
Chris
| [reply] |
Re: Quick and portable way to determine line-ending string?
by nardo (Friar) on Aug 09, 2001 at 02:55 UTC
|
Most (all?) platforms will use one or more of 0x0a and 0x0d characters as the newline, so something like:
sub seperator
{
my $cr = chr(0x0d);
my $lf = chr(0x0a);
if($_[0] =~ /([$cr$lf]+)/o)
{
return $1;
}
#unable to find seperator, handle error here.
}
This assumes that the file will not contain an 0x0a or 0x0d unless it is used as part of the newline, which should be true of a textfile.
Update If the first occurance of the newline is multiple newlines (for example "\n\n\nThree lines before me\n") then all of them will match and it will not return the correct line seperator. Best to just check for $cr$lf, $lf$cr, $cr, and $lf individually. | [reply] [d/l] |
|
Sadly, one can't assume that. For instance, I have seen a number of cases where a supposedly-text file from a Unix system has been edited on a MS-DOSish system and hence contains extra "\x0d" characters.
Then, of course, there's the question of what to do on EBCDIC systems, where the line endings are likely to be something entirely different.
| [reply] |
|
On an EBCDIC system, the line endings are probably "\r" and "\n", of course. And there is no point in using "\x0a" and "\x0c" in the previous code. The only use for "\x0a" and "\x0d" are when you might run under MacOS and are using something like a network protocol that requires "\r\n". MacOS made the mistake of changing the definition of "\r" and "\n" rather than translating them. All other system that use non-Unix line endings, _translate_ to/from "\n".
If it weren't for MacOS, "\r" and "\n" would always be the right choice. The move toward "\x0a" and "\x0c" has been motivated by trying to be portable with MacOS and has caused great confusion. Since very few Perl programmers actually work on the even weirder systems like those that use EBCDIC, the folly of this has not been widely noted (CGI.pm is one of the few places that I've seen start to notice this).
-
tye
(but my friends call me "Tye")
| [reply] |
|
If the file contains 0x0d characters then they are characters which are supposed to be part of the line seperator and will be caught by the code I wrote. While many unix tools will see the 0x0d as just another character, if you do infact have a file which has 0x0d 0x0a pairs, you probably want 0x0d 0x0a to be your line seperator. If you had mixed 0x0d 0x0a and just 0x0a line seperators then the code won't work but so long as it is consistent it should be fine.
| [reply] |
|
| [reply] |
Re: Quick and portable way to determine line-ending string?
by John M. Dlugosz (Monsignor) on Aug 09, 2001 at 02:58 UTC
|
Open a file and write "\n" to it.
Re-open the file and use binmode, then read it back in. The result is the desired string.
Something like this...
open $foo, $tempfilename;
print $foo "\n";
close $foo; # probably redundant, but why not.
open $foo, $tempfilename;
binmode $foo;
read $foo, $result, 999;
close $foo;
If you don't have a writable file (why not a valid temp directory??!) use a filehandle tied to a text buffer. Say, IO::Scalar.
Hmm, I tried that and it didn't work, as I half-suspected. The binmode thing is done in the C Standard library functions, and Perl might be relying on that and have no real knowledge of what it means on a given platform.
Are you sure you can't come up with a writable file, or a fake file that operates on the FD level rather than Perl's tie level?
—John
| [reply] [d/l] |
|
I mentioned doing what you suggest in my original post.
I had thought about IO::Scalar, but it no-op's binmode(). So you can't use that.
I believe the translation only happens on real physical disk files (I don't know about pipes, but there's no guarantee that a given platform will have pipes available).
I'd prefer to stay away from having to write to files, because many of A::Z's users are using it in web servers; writing to a file requires knowing where a temp file can be made, and might slow things down.
| [reply] |
|
I'm pretty sure it will have effect any time the underlying system's file IO is used, whether it's a real file or whatever.
As for slowing things down, does the line ending ever change? Figure it out once, and remember it. You could even make that part of installation.
Here is another idea. You can have a known file containing the line ending candidate and just read it. See which one was transformed into a simple "\n".
So, open (without binmode) lineend_mac and read it; ditto for lineend_pc and see which was a "\n". Neither really contained just a "\n".
—John
| [reply] |
Re: Quick and portable way to determine line-ending string?
by mdillon (Priest) on Aug 09, 2001 at 03:34 UTC
|
can't you split on the value of $/? (whose value is presumably related to $^O in the Perl source code and hence will always be in synch):
my @lines = split m#$/#, $content;
or how about just splitting on any line ending?:
my @lines = split m#\x0d\x0a?|\x0a#, $content;
to get around the multiple 0x0d problem, you could add \x0d+\x0a to the alternation as the first alternative (though it will slow things down on a Unix file with a lot of blank lines). come to think of it, \x0d{2}\x0a might be a better idea.
for EBCDIC, i think the first solution i mentioned should work.
for some reason, i feel like i'm missing something fundamental about your question, so if i'm just spouting crazy-talk, please ignore me.
| [reply] [d/l] [select] |
|
my $nativeSeparator = "\n";
if ($^O =~ /MSWin32|dos|os2|cygwin/)
{ # not sure what to do about cygwin here.
$nativeSeparator = "\x0d\x0a"
}
elsif ($^O eq 'MacOS') { $nativeSeparator = "\x0d" }
elsif ($^O eq 'VMS')
{
# it depends on file type... what to do?
}
elsif (ord('A') eq 193)
{
# what to do for EBCDIC? "\n" may be OK...
}
| [reply] [d/l] |
|
I think EBCDIC is going to be a pain. For one thing, many mainframe systems assumes fixed length records with no newline separator. I thought that different mainframes would use different characters to determine a newline. \x15 and \x25 are two that are used. Also, according to the documentation for Convert::EBCDIC, there is a standard EBCDIC and a version used for OS390 (which may account for the different line endings).
One problem there is that the EBCDIC Newline doesn't really translate to the ASCII CR or LF. Further, since the 'newline' varies on ASCII systems, I can only imagine that it's going to vary on EBCDIC systems. Admittedly, it's been a while since my mainframe days (no, I wasn't a Y2K boy), but I doubt you'll find a truly universal solution without the user choosing how their newline gets translated.
Here's an interesting chart of the EBCDIC characters. What the heck is a "Required newline" (\x06)? I sure as heck don't remember that.
Good luck.
Cheers,
Ovid
Vote for paco!
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
| [reply] |
Re: Quick and portable way to determine line-ending string?
by tachyon (Chancellor) on Aug 09, 2001 at 04:46 UTC
|
seek DATA,-6,1; # back up into __DATA__ string
binmode DATA;
$end = <DATA>;
$end =~ s/.*__//; # delete everything except the line ending
for(split//,$end){printf "0x%x\n",ord $_}
__DATA__
# prints (on Win32)
0xd
0xa
Make sure there is a \n after the __DATA__
Does this port?
This also works as you would expect:
my $tmp = 'c:/tmp.tmp';
open TMP, "+>$tmp" or die $!;
print TMP "\n";
seek TMP, 0, 0;
binmode TMP;
for(split//,<TMP>){printf "0x%x\n",ord $_}
close TMP;
unlink $tmp;
__END__
#prints (on Win32)
0xd
0xa
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] [select] |
|
That will tell you the line ending found in that file, which is where it was last edited before being installed. perl will tolerate all kinds of stuff, but that's not necessarily the native line ending of the system it's running on now. That is, you could plop the same file onto a Mac or a PC and still see the same value.
| [reply] |