http://www.perlmonks.org?node_id=312141

SavannahLion has asked for the wisdom of the Perl Monks concerning the following question:

I'm hoping someone can clarify some behavior for me.
I have the following two code blocks I'm fiddling with. First I tried the following block.
my $phrase = "This is a test, \"using quotes of 'two different' types. +\""; $phrase =~ s/[\S*\W*]//g; print $phrase;

For some reason, the regex destroys the entire line. Therefor, print just prints a blank line. So after some fiddling I came up with the following block.
my $phrase = "This is a test, \"using quotes of 'two different' types. +\""; $phrase =~ s/[^\s*\w*]//g; print $phrase;

Which does exactly what I was aiming for in the first place. It produces the following line: This is a test using quotes of two different types All quotes, periods, and everything else has been stripped.

In my llama book it states that [^\s] is the same as \S and [^\w] is the same as \W. Now, from what I understand so far, the first block of code should have worked, but it didn't.
Why is that?

Is it fair to stick a link to my site here?

Thanks for you patience.

Replies are listed 'Best First'.
Re: ^\s not equal \S?
by davido (Cardinal) on Dec 04, 2003 at 08:46 UTC
    First, quantifiers inside character classes are not seen as quantifiers, but rather, as literal characters to become a part of the character class. Put your * quantifier outside of the character class if that's what you intend.

    Next, [\S\W] means anything that's either a non-space character, or a non-word character (usually A-Za-z_). Well, just about everything is either a non-space or a non-word. In fact, since there is no overlap between \s and \w, you've just wiped out the entire line (especially with the /g modifier). Every character I can think of would either fit the "not space" or the "not word" catagory, and thus, every character is wiped out.

    The second expression is a negated character class. You still need to get rid of those * quantifiers inside of the square brackets. The negated character class is saying any character that is not either a space or a word. That's different. The only characters that are neither space nor word, are things like comma, quote, (and many others).

    So where your first regex matches everything, and substitutes it with nothing (thus wiping out the string), the second regex matches just characters that are neither space nor word, and substitutes those characters with nothing, leaving you with spaces and words.


    Dave

Re: ^\s not equal \S?
by Abigail-II (Bishop) on Dec 04, 2003 at 09:54 UTC
    $ perl -Dr -ce '/[\S*\W*]/' Compiling REx `[\S*\W*]' size 13 Got 108 bytes for offset annotations. first at 1 1: ANYOF[\0-\377!utf8::IsSpacePerl !utf8::IsWord](13) 13: END(0) stclass `ANYOF[\0-\377!utf8::IsSpacePerl !utf8::IsWord]' minlen 1 Offsets: [13] 1[8] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 9[ +0] Omitting $` $& $' support. EXECUTING... -e syntax OK Freeing REx: `"[\\S*\\W*]"'

    Now, pay attention to the 'ANYOF' part. It includes all the ASCII and LATIN-1 characters (it also includes lots of Unicode characters, but that's not important right now).

    Abigail

      Now that's a neat trick. After getting up off the floor following fainting at the very sight of it, I'll have to read perlrun again to understand how you did it. It's high time I dig into perldebug too, I see.

      Thanks for the motivation / lesson. :)


      Dave

        I find '-Dr' far more useful than 'YAPE::Regex::Explain'. The latter just parrots back what it was given, but then in English. '-Dr' shows how perl compiles it. As shown in the thread 'YAPE::Regex::Explain' doesn't notice the overlap between \S and \W, not even that '*' is mentioned twice. '-Dr' shows what's really going on, although the output is sometimes hard to grog.

        And '-Dr' really shines at runtime, showing how Perl actually matches a regexp.

        Abigail

Re: ^\s not equal \S?
by allolex (Curate) on Dec 04, 2003 at 09:18 UTC

    Here's some data to illustrate what is happening. I think davido's explanation really hits the nail on the head and Anonymous Monk's suggestion YAPE::Regex::Explain will help you isolate what is happening with your experimentation. BTW, including an asterisk in some of the character classes is redundant, other times, you might see all your asterisks disappear from your input---not what you intended, I think.

    ladoix% cat 312141.pl #!usr/bin/perl use strict; use warnings; my $phrase1 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase1 =~ s/[\S\W]*//g; print "P1: [$phrase1]\n"; my $phrase2 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase2 =~ s/[\S\w]*//g; print "P2: [$phrase2]\n"; my $phrase3 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase3 =~ s/[\s\W]*//g; print "P3: [$phrase3]\n"; my $phrase4 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase4 =~ s/[\s\w]*//g; print "P4: [$phrase4]\n"; my $phrase5 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase5 =~ s/[^\s\w]*//g; print "P5: [$phrase5]\n"; my $phrase6 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase6 =~ s/[^\S\w]*//g; print "P6: [$phrase6]\n"; my $phrase7 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase7 =~ s/[^\s\W]*//g; print "P7: [$phrase7]\n"; my $phrase8 = "This is a test, \"using quotes of 'two different' types +.\""; $phrase8 =~ s/[^\S\W]*//g; print "P8: [$phrase8]\n"; ladoix% perl 312141.pl P1: [] P2: [ ] P3: [Thisisatestusingquotesoftwodifferenttypes] P4: [,"''."] P5: [This is a test using quotes of two different types] P6: [Thisisatest,"usingquotesof'twodifferent'types."] P7: [ , " ' ' ."] P8: [This is a test, "using quotes of 'two different' types."]

    --
    Allolex

Re: ^\s not equal \S?
by Anonymous Monk on Dec 04, 2003 at 08:50 UTC
    use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/[\S*\W*]/)->explain; __END__ The regular expression: (?-imsx:[\S*\W*]) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- [\S*\W*] any character of: non-whitespace (all but \n, \r, \t, \f, and " "), '*', non-word characters (all but a-z, A-Z, 0-9, _), '*' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    20031204 Edit by Corion: Changed PRE tags to CODE tags

Re: ^\s not equal \S?
by Anonymous Monk on Dec 04, 2003 at 15:04 UTC
    Welcome to Boolean Algebra!
    [^\s\w] = NOT (space OR word) [\S\W] = (NOT space) OR (NOT word) (NOT space) OR (NOT word) = NOT (space AND word)
      Well, I finally did it. I ++'d an anonymous post. Good answer. I was going to say something like:

      [^\s\w] means neither spaces nor words, or, everything but spaces and words, where
      [\S\W] means non-spaces AND non-words, and since a space is a non-word, the meaning is similar to "non-spaces and spaces".

      However, ^\s and \S should be the same, so...

      $phrase =~ s/[^\s*\w*]//g;
      would do the same thing as...
      $phrase =~ s/[\S]|[\W]|[^*]//g;

      Update: No, it wouldn't. :: smacks self in head ::

        Wait a minute.... I just tried your example and it still wipes out the string. It's like a nice big OR isn't it? So isn't s/[\S]|[\W]//g really the same as s/[\S\W]//g?

        Edit: A negative demark for pointing out this out? Bleh, now that isn't exactly fair. Oh well, I still haven't quite worked out what benefits voting would have.
        Anyhow, I've fiddled with the regexes and the response from delirium still doesn't quite fit in with what everything else tells me. Plugging in the two different RegExs in to the script yields completely different results. And given the examples by anon, it would naturally make sense.

        Is it fair to stick a link to my site here?

        Thanks for you patience.

Re: ^\s not equal \S?
by SavannahLion (Pilgrim) on Dec 04, 2003 at 18:08 UTC
    Oooohh, I get it now.

    Is it fair to stick a link to my site here?

    Thanks for you patience.