Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

How do I stop this from removing spaces?

by neXussT (Initiate)
on Jun 25, 2010 at 00:56 UTC ( #846428=perlquestion: print w/replies, xml ) Need Help??
neXussT has asked for the wisdom of the Perl Monks concerning the following question:

I'm using the following code to remove special characters from a string. It works well, but it also removes spaces, which I don't want. Does anybody know why it does this and how I can get around it?
$id='TEST TEST'; print "$id\n"; @filename_filter=('*','|','<','>','?','/'); $id =~ s/[@filename_filter]//g; print "$id\n";
Thanks ;)

Replies are listed 'Best First'.
Re: How do I stop this from removing spaces?
by toolic (Bishop) on Jun 25, 2010 at 01:27 UTC
    If you set the $LIST_SEPARATOR ($") variable to the empty string (it is a single space by default), your substitution will not remove a space. The array variable is being interpolated as if it were in double quotes inside s/// (see Quote and Quote like Operators):
    $" = ''; $id='TEST TEST'; print "$id\n"; @filename_filter=('*','|','<','>','?','/'); $id =~ s/[@filename_filter]//g; print "$id\n"; __END__ TEST TEST TEST TEST
Re: How do I stop this from removing spaces?
by ikegami (Pope) on Jun 25, 2010 at 03:18 UTC
    For good measure, this adds \Q so it works if you happen to include ^, ] or -.
    my @filename_filter = ('*','|','<','>','?','/'); my $filename_filter = join '', @file_filter; $id =~ s/[\Q$filename_filter\E]//g;

    That said, it's much much safer to specify which characters are safe to include rather than then specifying which characters are not safe to include.

Re: How do I stop this from removing spaces?
by AnomalousMonk (Chancellor) on Jun 25, 2010 at 01:27 UTC

    Scalars and arrays interpolate into regex expressions with the same rules as into double-quoted strings. Arrays interpolate with a separator string defined by the $" special variable, a single space by default. So the regex looks like
        s/[* | < > ? /]//g
    I.e., the character class contains a space.

    join the array to a scalar with the empty string as the separator, then use the scalar in the regex.

Re: How do I stop this from removing spaces?
by ww (Archbishop) on Jun 25, 2010 at 01:53 UTC
    TIMTOWTDI/Tangential observations (See the on-target discussions of interpolation by toolic, AnomalousMonk and Anonymonk, above):

    Your asterisk, vertical_bar and questionmark are special chars in a regex; specifically, the "*" and "?" are quantifiers. So, since the asterisk appears first in your array, you're trying to feed the regex engine a quantifier without an antecedent; i.e., without anything to match.

    So, another way around this would be to escape those characters that have special meaning in a regex; viz:

    #!/usr/bin/perl use strict; use warnings; # 846423 my $id='TEST TEST foo|bar*test'; my @testdata = split(/ /, $id); print "Pre-regex: $id\n"; my @filename_filter=('\*', '\|', '<', '>', '\?', '/'); # $id =~ s/[@filename_filter]//g; for my $filename_filter(@filename_filter) { $id =~ s/$filename_filter//g; } print "post-regex: $id\n\n"; for my $testdata(@testdata) { for (@filename_filter) { my $regex = qr($_); $testdata =~ s/$regex//; } print "split&regex: $testdata\n"; }
    Pre-regex: TEST TEST foo|bar*test post-regex: TEST TEST foobartest split&regex: TEST split&regex: TEST split&regex: foobartest

    This appears to satisfy your spec without changing $", the $LIST_SEPARATOR. It IS the long way around, and not merely because my code goes out of it's way to be explicit, but perhaps addresses a closely related issue.

    Update: corrected citation to refer to all the preceding, correct discussions of interpolation

Re: How do I stop this from removing spaces?
by chuckbutler (Prior) on Jun 25, 2010 at 01:58 UTC

    There is an implicit join of the list being done when the regular expression is being compiled. This is using the $", dollar-double-quote, variable as the separator character, which defaults to a space. So, the character class that is used in the substitution contains a space, and therefor spaces are removed. Better for oneself to do the join:

    use strict; use warnings; use re 'debug'; # shows whats-what my $id='TEST TEST'; print "$id\n"; my @filename_filter=('*','|','<','>','?','/'); my $filename_filter_joined = join '',@filename_filter; #empty separator character $id =~ s/[$filename_filter_joined]//g; print "$id\n"; __END__ ~~Output~~ TEST TEST Compiling REx "[*|<>?/]" Final program: 1: ANYOF[*/<>?|][] (12) 12: END (0) stclass ANYOF[*/<>?|][] minlen 1 Matching REx "[*|<>?/]" against "TEST TEST" Matching stclass ANYOF[*/<>?|][] against "TEST TEST" (9 chars) Contradicts stclass... [regexec_flags] Match failed TEST TEST Freeing REx: "[*|<>?/]"

    Good luck. -c

Re: How do I stop this from removing spaces?
by Anonymous Monk on Jun 25, 2010 at 01:28 UTC
    Interpolation, observe
    $ perl -Mre=debug TEST TEST Compiling REx "[* | < > ? /]" Final program: 1: ANYOF[ */<>?|][] (12) 12: END (0) stclass ANYOF[ */<>?|][] minlen 1 Matching REx "[* | < > ? /]" against "TEST TEST" Matching stclass ANYOF[ */<>?|][] against "TEST TEST" (9 chars) 4 <TEST> < TEST> | 1:ANYOF[ */<>?|][](12) 5 <TEST > <TEST> | 12:END(0) Match successful! Matching REx "[* | < > ? /]" against "TEST" Matching stclass ANYOF[ */<>?|][] against "TEST" (4 chars) Contradicts stclass... [regexec_flags] Match failed TESTTEST Freeing REx: "[* | < > ? /]" $ perl -le"print qq!@ARGV!" 1 2 "3 4 5" 1 2 3 4 5
Re: How do I stop this from removing spaces?
by colwellj (Monk) on Jun 25, 2010 at 01:26 UTC
    I'm not sure exactly why but it seems as if dropping your list in as an array is adding a space somewhere.
    I tried putting your special characters into a string and used that instead and it works fine.
    So that is a workaround for you. Possibly somemonk more knowledgeable might be able to tell why it's not working this way.
Re: How do I stop this from removing spaces?
by Xiong (Hermit) on Jun 28, 2010 at 12:40 UTC

    Slightly off your original question (which has been thoroughly answered anyway):

    1. qr

    Why did you build up your regex from a list, anyway? It's more usual to write something like:

    my $regex = qr/abcde/;

    If you're going to be re-using the same regex, this is a good way to do it. Like all generic quoting mechanisms, you can choose your delimiters.

    2. Removing Special Characters

    You are trying to remove special characters, which looks very like you are sanitizing input, perhaps to pass a taint check. By now, you have a way to do that -- but it may still not be the best thing to do.

    It's very easy to let a character slip by which, sooner or later, winds up being parsed by something that gets fouled up, opening a hole to an attacker. It's considered safer not to reject unsafe characters but to test to see that the string in question is as you expect; and only as you expect:

    sub untaint_name { # replace non-word chars with nothing my $name = shift; $name =~ s/\W//g; return $name; };

    This does not guarantee correctness or perfect security but it's considered more robust; certainly a bit easier to read and understand.

    If you know you only want lowercase input, lc() it. You might make your database tables and fields all uppercase. Do this after the above regex replacement.

    If you wanted to accept only a numeric input, you might use \D to eliminate all non-numeric chars. By 'numeric', here we mean the digits 0-9; the positive integers with or without leading zeros. To deal with various fixed, floating point, or negative formats, you would have to accept (not reject) [.-+E] as well. You might just want, after your input passes the regex, to further sanitize by $number = 1+ $string -1; (Just adding 0 will probably be optimized away, no matter how you do it.) If you know your input should lie between certain bounds, test for that.

    The key point is to demand what you want, not reject what you don't want.

    - the lyf so short, the craft so long to lerne -
      I found putting the unwanted characters in a s/$string//g; is easier than using an @array. I first though of using a white list as you mentioned, but I must have had a brain fart, because I could not figure out for the life of me an easy way to do it. I forgot \W would work. Thanks for the tip.
Re: How do I stop this from removing spaces?
by neXussT (Initiate) on Jun 25, 2010 at 21:37 UTC
    Awesome. Thanks guys. I really appreciate you taking the time to help me understand what's going on.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://846428]
Approved by planetscape
Front-paged by toolic
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2018-05-23 08:04 GMT
Find Nodes?
    Voting Booth?