Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

(Regex Madness) And you thought whitespace was easy.

by japhy (Canon)
on Aug 15, 2001 at 08:17 UTC ( #104947=perlmeditation: print w/replies, xml ) Need Help??

So on OpenProjectsNet IRC today, someone asked how to scrunch chunks of whitespace into a single space, unless the chunk was "\n\n" (in which case it doesn't get altered). So I gave him:
s/(\s+)/$1 eq "\n\n" ? $1 : " "/eg;
That was fine. But I wondered, how should "\t\n\n" or "\n\n\t" be handled? Well, I came up with a truly hideous, yet truly working, regex. One regex. With no /e modifier. One catch: variable-width look-behinds. How did I get around it? Why, sexeger of course. This was another use of that technique to solve an interesting problem. You see, "\n\n\n" should match the "\n\n" as one unit (leaving it intact) but then "\n" as a chunk to be turned into a single space. However, "\n\n\n\n" should be seen as two "\n\n" units.

The problem is that when I come across a newline, I need to see if it is preceded by an even number of newlines. Ordinarily, I'd say /\n(?=(?:\n\n)*(?!\n))/ to denote "a newline followed by an even number of newlines". Sadly, I can't use that for look-behind: /(?<=(?<!\n)(?:\n\n)*)\n/ doesn't work because it's variable width. Solution? Reverse it.

So I offer this regex:

($_ = reverse) =~ s{ (?: [\r\t\f ]+ # non-\n whitespace | # OR (?<!\n) # not preceded by a \n \n # match a \n (?= # that's followed by... (?:\n\n)* (?!\n) # an even number of \n's ) )+ # one or more times }{ }xg; # turn it into a single space $_ = reverse;
Whew.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re (tilly) 1: (Regex Madness) And you thought whitespace was easy.
by tilly (Archbishop) on Aug 15, 2001 at 13:00 UTC
    Alternate solution without reverses:
    s/(?=\s)\s*?(?:(\n\n)|(?=\S))/$1||" "/eg;
    UPDATE
    A better alternate solution. This will handle Mac, Unix, and DOS line endings from any of those OSs, and it will also handle blank lines even if they have whitespace. I have broken it out so people can see the tricks.
    s/ (?=\s) # Be sure we will match some whitespace \s*? # Match as little whitespace as we can (?: # To either: ( # Capture blank line...? (?:\r\n?|\n\r?) # First line ending. \s*? # Any whitespace on the line. (?:\r\n?|\n\r?) # Second line ending. ) # End capture. | # or (?=\S) # End of the whitespace. ) # / $1 || " " # Either the blank line or a space. /egx;
Re: (Regex Madness) And you thought whitespace was easy.
by runrig (Abbot) on Aug 15, 2001 at 08:49 UTC
    This is cheating a little, but quicker:
    # If you have any spare characters to use # (lets say '~') sub my_squish { local $_ = shift; s/\n{2}/~/g; s/\s+/ /g; s/~/\n\n/g; $_; }
    Update: True enough japhy, but I said I was cheating :) And even though it scans the string three times, its simple and quicker than doing the reversing, lookaheads & lookbehinds necessary to do it in one regex.
      It requires you to scan your string for a substring not yet used. That requires you scan the string at least once, or use a string that is one character LONGER than your string.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        How about this then:
        join "\n\n", map { s/\s+/ /; $_ } split(/\n{2}/, $_, -1);
Re: (Regex Madness) And you thought whitespace was easy.
by John M. Dlugosz (Monsignor) on Aug 15, 2001 at 10:03 UTC
    How about this one: \n\n is a paragraph break which stays such, but other whitespace is collapsed. So..

    When we type text, we might leave more than one blank line, but that still indicates one paragraph break. Or the line might not be empty but have a space on it, etc.

    So, a whitespace sequence that contains at least two \n characters is replaced by \n\n, and all other whitespace sequences are replaced by single space, and leading/training whitespace is thrown away.

    —John

(tye)Re: (Regex Madness) And you thought whitespace was easy.
by tye (Sage) on Aug 15, 2001 at 22:38 UTC

    "\n\n\n" eq reverse "\n\n\n", so I don't see how reversing helps here. Could you explain more?

            - tye (but my friends call me "Tye")
      If I don't reverse the string, then "\n\n\n" gets turned into " \n\n". If I reverse the string, do the change, and reverse it again, I get "\n\n " instead.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://104947]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2022-01-29 01:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (74 votes). Check out past polls.

    Notices?