Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Handling Mac, Unix, Win/DOS newlines at readtime...

by sauoq (Abbot)
on Sep 16, 2002 at 03:53 UTC ( #198143=note: print w/replies, xml ) Need Help??


in reply to Handling Mac, Unix, Win/DOS newlines at readtime...

I would split on /\r\n?/ instead. That avoids removing blank lines.

Update: In answer to graff's reply, /\r\n?|\n/ will work on all three platforms. I would probably just fix the original files with something based on the first regex I gave though. Better to standardize the files right off the bat. Customizing all sorts of code to deal with all three file types will get old real quick.

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by graff (Chancellor) on Sep 16, 2002 at 04:12 UTC
    /\r\n?/ will fail to split lines that were created on unix systems. Eliminating blank lines might not be so bad, but if it's an issue, then:
    split(/\r\n|\r|\n/);
    Just doing /[\r\n]{1,2}/ will lose some blank lines on unix or mac input; and it's important to try to match the longer pattern first.
      but what if a file was created on a Windows machine, but this code was being run on a Mac?

      I remember reading somewhere in this thread that \r and \n have reversed semantics on the Mac (vs. *nix, Windows).

      So maybe we really want the following: split(/ \r\n | \n\r | \r | \n /x); # (yoicks!)

      My $0.02,

      -- jkahn

        but what if a file was created on a Windows machine, but this code was being run on a Mac?

        It wouldn't matter which type of system was running the perl code.

        I remember reading somewhere in this thread that \r and \n have reversed semantics on the Mac (vs. *nix, Windows).

        Um, no, that statement hasn't been made on this thread. My own experience has been that MS systems use "\r\n", all .n.x systems use "\n" and (older) Mac systems use "\r". Nobody uses "\n\r".

        And now that MacOS-X is out with a unix foundation, maybe the number of variants will reduce to just two instead of three.

        Yes, Macs have a backwards notion of what \r and \n are in ASCII (was this changed in OS X?) However, if the orginal poster is running the Perl script on a *nix or Windows box, it shouldn't matter.

        BTW--My favorite way of dealing with the Mac's reversed notion of CR and LF is to use the octal ASCII value instead. \r = \015, \n = \012 (IIRC). You'll probably have issues with Unicode, though.

Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by bart (Canon) on Sep 16, 2002 at 23:44 UTC
    I would split on /\r\n?/ instead. That avoids removing blank lines.
    But not on a Mac. On a Mac, the meaning of "\n" and "\r" got reversed. "\n" is what you use as native end-of-line characters, remember? And on a Mac, that's chr(13).

    Also, as people tend to forget to upload their HTML as text, you often get sequences of two CR characters and one LF. You want to deal with that, too. So here's my solution:

    /\015\015?\012|\015|\012/
    which you might want to replace with "\n" using s///g, instead of splitting on it, so you get one cleaned up string, to feed into HTML::Parser or similar.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://198143]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2018-01-16 15:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (182 votes). Check out past polls.

    Notices?