Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

line endings in remote documents

by Amoe (Friar)
on Dec 20, 2001 at 20:40 UTC ( #133506=perlquestion: print w/replies, xml ) Need Help??
Amoe has asked for the wisdom of the Perl Monks concerning the following question:


I've been playing with an update feature on one of my scripts. What it does is it gets a raw text file from a website, which is a newline-delimited list of other resources my program can use. I should store this locally. Thing is, my program will already know about some of the URLs in the list. It builds a hash at startup called %resources, with the keys being the locations it knows about and the values being 1 so that I can just lookup to know if I know about something already. So what I want to do, then is get the file, parse it, filter out teh stuff I know about, and then append the new stuff to the end of the local file. Easy enough, thought I. I started off with this:

my $repository = $switches{u} =~ /^http:\/\// ? $switches{u} : 'http:/ +/'; my $raw = get_object($repository, 'pronbot_update_tgps'); my @lines = split /\n/, $raw; my @new; !$resources{$_} && push(@new, $_) foreach (@lines);
Then, I thought, @new would contain all the unknown URLs, and it would then be trivial to join them with newline and write them to the file. I've stumbled across a problem, though. When I've made @lines, all its elements seem to have the string '\cM' at the end of them. I'm running my code on Windows.

I think this must be a OS line-ending problem. I thought that "\n" adapted to that, though, and split would remove these when I split on that pattern. Apparently not, and I haven't got a clue how to do it apart from looping through and removing the literal pattern, and I figure there must be a better way than this. What if someone who uses it wants to change the repository via -u, to a server which uses different line-endings? That may well mess up the code if I just remove the pattern.

Anyone able to enlighten me about this baffling problem?

my one true love

Replies are listed 'Best First'.
Re: line endings in remote documents
by Juerd (Abbot) on Dec 20, 2001 at 20:51 UTC
    To substitute Macintosh (\cM), DOS (\cM\cJ) and Unix (\cJ) linefeeds, use s/\r\n?|\n/(something)/g.
    If you just want to remove them, tr/\cM\cJ//d is a lot faster.

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

(tye)Re2: line endings in remote documents
by tye (Sage) on Dec 20, 2001 at 23:55 UTC

    The conversion, under Windows, from "\r\n" to "\n" happens (only!) when reading from a file that isn't in binmode. It sounds like you are reading these file contents over a socket, which are always in binmode (since you don't want to assume that the system on the other end of that socket is using the same line endings as you). So if the remote system writes "\r\n" into the socket, then Perl will read "\r\n" from that socket.

    If it weren't for MacOS's bad design decisions, then this would be fairly easy to deal with. If you don't care about ignoring blank lines, then you can split on /[\r\n]+/ to work around it.

    If you don't ever intend to run your code on a Mac, then you can split on /\r*\n|\r/ to handle a wide variety of cases (unfortunately, finding a line ending of "\r\r\n" isn't that hard to do). If you don't even intend to run your code on a non-ASCII system, then you can split on /\cM*\cJ|\cM/ and be happy even if your code is run on a Mac.

            - tye (but my friends call me "Tye")
Re: line endings in remote documents
by ehdonhon (Curate) on Dec 20, 2001 at 20:49 UTC

    Yeah, MS platforms add a carriage return and a newline to the end of every line. You could try chomp()'ing all of the contents of @new. I'm not sure if chomp will get rid of the carriage returns or not.

    On a design note, it appears that your algorithm assumes that once a resource has been provided, it never goes away. By that, I mean that since you are always appending, you can't remove a resource by removing it from your website. That may not be an issue for you, but I thought I'd point it out.

      chomp will remove $/. Which will be set to \n\r by default on windoze. For cross-platformness try something like: split/(?:\n|\r|(?:\r\n))/. That should cover all that I know of (Unix \n, Microsoft \r\n and Macintosh \r).

      UPDATE: Fixed MS line endings *sigh* Je suis tired pantalons

      perl -pe "s/\b;([st])/'\1/mg"

        The other way around :)

        Mac  CR   \015     \x0D     \r
        DOS  CRLF \015\012 \x0D\x0A \r\n
        *Nix LF   \012     \x0A     \n
        (Assuming \r is chr(13) and \n is chr(10), which isn't always true)

        The regex to substitute them all would be s/\cM|\cM\cJ|\cJ/$foo/, which can be simplified to s/\cM\cJ?|\cJ/$foo/. But if you don't need to substitute, removing can be done a lot faster by just using tr/\cM\cJ//d (the /d will have tr/// delete characters not found in the replacement pattern (the replacenent pattern is empty in this example)).

        2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

        No, no, no!! $/ will be "\n" by default on Windows, just like it is (nearly?) everywhere else!

                - tye (but my friends call me "Tye")

        Or, if you're sure no \ns or \rs will appear in the middle of lines, you can do it even more simply:

        split /[\n\r]+/, ...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://133506]
Approved by root
[choroba]: My average XP per node lowered in the last three months :(
[choroba]: (check my homenode for the graph link and the tool to create it)
[LanX]: blocked by FW :(
oldtechaa takes a cookie from the platter on the sideboard.

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (8)
As of 2017-04-26 12:02 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (474 votes). Check out past polls.