Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Newlines: reading files that were created on other platforms

by bobf (Monsignor)
on Feb 02, 2005 at 00:48 UTC ( #427108=perlquestion: print w/ replies, xml ) Need Help??
bobf has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a program that is being run on different platforms (Linux, Windows, and Mac). I tried to write it to be as system-independent as I could (using File::Spec for paths, etc), but recently someone reported a bug. It turns out that she was creating one of the input files on a Mac, then transferring it to a Windows machine and running the program (I didn't think of that...). The error occurred when the program tried to read the input file line-by-line. I presume that since the program was run on a Windows machine, the input record separator ($/) was set to the Windows newline (\015\012). The input file was created on a Mac, though, so it had newlines of \015. As a result, the file got slurped and things turned ugly.

Now I'm trying to figure out how to handle this situation. I reread perlport, as well as 3 questions... 2 about newlines, and one on how to be NICE and Line Feeds, but they all seem to address writing specific newline characters to files, not reading them.

Here is what I came up with so far:

  1. Use $^O, but if I understand it correctly that will just tell me about the system the program is running on, which (as exemplified here) is not necessarily the same as the system that created the file.
  2. Use a regex to match the newline character(s) in the file. I think this would require slurping the whole file and then doing something like if( $file =~ m/\015$/ ) (which assumes the file will end with a newline) or if( $file =~ m/\015(?!\012)/ ) (which doesn't), setting $/ according to what matched, and re-reading the file line-by-line.
  3. Preprocess the input file to convert all newline characters to the current system's newline character. I experimented a little, and I think this will work:
    $file =~ s[(\015)?\012(?!\015)][\n]g; $file =~ s[(\012)?\015(?!\012)][\n]g;

    I think this is my favorite solution, but it seems like a lot of extra overhead for each input file since the conversion only needs to occur once (assuming the input file is not then moved to another OS).

Are there better ways of handling this?

Thanks!

Comment on Newlines: reading files that were created on other platforms
Select or Download Code
Re: Newlines: reading files that were created on other platforms
by blueberryCoffee (Scribe) on Feb 02, 2005 at 01:23 UTC
    Since the \015 and \012 are not used for anything but newlines why not just look through the file _untill_ you see \015 or \012. If \015 is found and followed by \012 you know it is windows. If \015 without \012 you know its mac, and \012 by itself is linux.
      You can't do this, it isn't enough.

      In some situations, mostly ugly ones involving CVS, you can end up with mixed carriage returns in the file. Some Mac person edits 50 lines in the middle and the result is a mixed mess.

      So there are cases where doing anything by doing the two part, "detect the first and then process assuming it stays the same" just won't be good enough.

      You really do need to do some preprocessing to "localize" all of the newlines.
Re: Newlines: reading files that were created on other platforms
by Anonymous Monk on Feb 02, 2005 at 03:54 UTC
    Unix variants support the file commamd to determine the file type . Don't know if there is a similar Windows command, but probably is... If so, .. can you simply test input files to determine the file type, and adjust what the input line seperator is for that file ??
Re: Newlines: reading files that were created on other platforms
by aufflick (Deacon) on Feb 02, 2005 at 04:21 UTC
    I usually do something like the regexes you showed in #3

    If you are worried about huge file sizes (or you have to handle a stream), you can read a file a byte/char at a time until you find a run of either of your two regexes and then process the line up to that point.

Re: Newlines: reading files that were created on other platforms
by AJRod (Scribe) on Feb 02, 2005 at 04:27 UTC
    I lost my temper on the same problem quite recently and resorted to barbaric (bordering on, if not actually a, "this is not entirely a Perl solution") means. If you intend to display the output to HTML like I intended to, what I did might be of help:

    I simply wrapped the entire "offending" input file text within <pre></pre> tags and displayed it into an iframe in the same page. This preserved the layout of the page while displaying the text in a separate section of the page. It also allowed for the use of a menu of the files, created only once at the time page is loaded. Clicking on the menu items displayed each file in the iframe without reloading the entire page, unless you need to in case someone updates the directory from which the menu originates. You still need to use Perl a script (aha!) to read the files and spew it into html into the iframe.

    Be warned however that I haven't tested this on a MAC (which are virtually non-existent in my direct social environment).

    I hope this helps.

Re: Newlines: reading files that were created on other platforms
by adamk (Chaplain) on Feb 02, 2005 at 06:04 UTC
    As usual, CPAN shows the way.

    For about a year, I was in a situation with mixed Windows, Unix, AND Mac carriage returns, and I think I can safely say I've seen just about every screw up there is.

    I evolved a regex over the years, a "universal line seperator" that handles all three newline formats, and a couple of common ugly mistakes that happen.

    About a week ago, I rolled it into a CPAN module.

    Go check out File::LocalizeNewlines.

    It's only new, and the recursive mode might not handle binary files cleanly at this point, but all you really need is.

    use File::LocalizeNewlines; File::LocalizeNewlines->localize( $filename );
Re: Newlines: reading files that were created on other platforms
by monkey_boy (Curate) on Feb 02, 2005 at 09:32 UTC
    Hi, this works on linux & windows, not tried mac,:

    open(FH,"<:crlf",$file)


    I should really do something about this apathy ... but i just cant be bothered
      Can someone explain this one a little more.
      I did a quick look in programming perl and Advanced perl programming and did not see anything like this.

      Thanks
        Hi, sorry for the terse answer!
        Its the syntax for combining this
        open(FH, "<$file"); binmode(FH, ":crlf");

        The :crlf bit just tells perl to look for varients of line-ending sequences & turn them all into "\n".
        there is more info in the Cammel book.

        Update
        In the "open" documentation, Page 754 in the 3rd edition.






        I should really do something about this apathy ... but i just cant be bothered
Re: Newlines: reading files that were created on other platforms
by tbone1 (Monsignor) on Feb 02, 2005 at 13:04 UTC
    Is the person using OS X or the older Mac OS? OS X is BSD Unix underneath, so it should have just the newline "\n".

    Really, though, the problem is probably in the FTPing. I've run into this before, more times than I can count, and the individual needs to be sure to transfer the file in ASCII mode. Any FTP program worthy of the name should know to handle that. In fact, by correcting the user, you might save problems later on another mixed-platform project.

    Just my $.02.

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

      Nice theory, but unfortunately incompletely. To maintain backward compatibility with the previous decade-and-change worth of documents, GUI apps in OS X (e.g. TextEdit) generally use the Mac line-ending convention (CR), while BSD-derived command-line programs (e.g. perl) use the Unix one (LF). This could be charitably described as a mess.

      I agree with your solution, though—make FTP work right, and the problem should go away. There might be a more specific solution to this specific problem as well, but absent more details on the nature of the program, it's going to be tough to say anything very helpful.



      If God had meant us to fly, he would *never* have given us the railroads.
          --Michael Flanders

Re: Newlines: reading files that were created on other platforms
by msemtd (Scribe) on Feb 02, 2005 at 14:12 UTC
    When reading text of an unknown source that is likely to mix line endings (I find this in the html source of quite a few websites and in WinNT error messages), I tend to preprocess with "tr/\15\12/\n/s;" and then carry on as if nothing was amiss. This does, however, depend _entirely_ on what you want to do with the rest of the data. To demonstrate its usefulness...
    #!/usr/bin/perl -w use strict; use Data::Dumper; $Data::Dumper::Useqq = 1; my $text; my @e = ("\012", "\015", "\012\015", "\015\012"); $text .= "this is the line ". ($_ + 1) . $e[$_] for (0..$#e); print "Before: ". Dumper $text; $text =~ tr/\15\12/\n/s; print "After: ". Dumper $text;
    Good luck.
Re: Newlines: reading files that were created on other platforms
by periapt (Hermit) on Feb 02, 2005 at 15:00 UTC
    Preprocessing input files is not something to be afraid of. Of course, it depends on the flow of your data but it is sometimes more efficient to split the job into two distinct, simpler parts than to try and code one more complex solution.

    It seems as if you have a (approximately) a set of client system running something like a data entry or data processing system. You are taking the results from these client systems and loading/processing them into some master program. Even if you are just swapping files among systems, a preprocessing step could be very helpful (your milage may vary). The benefit is that, at a certain point in the flow of data, the data will all look exactly the same regardless of originating/destination platform. That can greatly simplify further processing downstream of that point.

    Its worth looking at more closely.

    PJ
    use strict; use warnings; use diagnostics;

      I'll have to agree with PJ and adamk here. Checking for each type of newline ending would be horrible, but in a pre-processing situation it might actually speed up the processing time. Another suggestion, not that I'm aware of how the input file is obtained, in the creation of the input file, use a specific newline of your choice to make it uniform and use an output method that would determine what to use when the program is running.

      I.E. Program 1 or Section 1 of program takes user input, uses \015 as a standard newline, instead of a system specific newline and creates the input file with that format. Upon use of the input file, the program then spits out system specific newline for "display" of the file.

      "I have said, Ye are gods; and all of you are children of the most High." - Psalms 82:6
Re: Newlines: reading files that were created on other platforms
by bobf (Monsignor) on Feb 02, 2005 at 22:29 UTC

    First of all, thank you for all of the replies.

    To clarify: the input files are just text files created (and edited) by the user. It is possible for a file to be created (or edited) on one platform and then moved to another to be processed, but I expect this to be a rare occurrence. That said, it already happened. :)

    After reading the replies here and thinking about this a little more, I think preprocessing is the best way to go in this case. My initial reluctance towards this approach stems only from the fact that each input file will be reprocessed every time the program is run. Thinking out loud: I would be very surprised if any one input file exceeded 1 MB, so slurping it is not an issue with respect to memory. Maybe I could save time by only writing the file back out if a newline character did not match \n (figuring out how to do that will be the next step - perhaps using binmode to make sure they don't get converted to \n on input (per monkey_boy's suggestion), or something like msemtd's example). I will also look into adamk's File::LocalizeNewlines module.

    periapt and Drgan summed up the implementation and advantage of this approach quite nicely. I'll write a sub to preprocess the files and see how it goes.

    Thanks again for the comments. I appreciate the input.

    bobf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://427108]
Approved by Limbic~Region
Front-paged by holli
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (8)
As of 2014-08-29 11:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (280 votes), past polls