regex to remove all non a-z and spaces

coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex to remove all non a-z and spaces by graff (Chancellor) on May 16, 2005 at 04:07 UTC
Yet another way, using the "tr" operator instead of "s///" (assuming you meant to retain all digits, not just 8 and 9). `tr/0-9A-Za-z \t\n\r//cd;` [download] It's a little more tedious than s///, because you can't use handy shortcuts like "\s" as a cover term for all whitespace, or "\d" for all digits, but there's a good chance that if speed is an issue, it would go faster than s///. The "c" at the end means "apply replacements to the complement of characters specified on the left side", and "d" means "delete characters for which there is no replacement character on the right side". Since there are no replacement characters at all, then everything that is not a letter, digit or whitespace will be deleted. (You did say you wanted to retain only letters, digits and spaces, so maybe you really don't want "\t\n\r" in the expression.)	[reply] [d/l]
Re: regex to remove all non a-z and spaces by davido (Cardinal) on May 16, 2005 at 05:41 UTC
Don't forget that a-zA-Z is not going to work for all alphabets in all languages. This may not be a problem for your application, but if you use locales, it is a potential issue and POSIX is your friend: `s/[^[:alpha:]89\s]//g;` [download] Dave	[reply] [d/l]
Re: regex to remove all non a-z and spaces by mrborisguy (Hermit) on May 16, 2005 at 02:42 UTC
how about: `s/[^a-zA-Z0-9 ]//g` [download] have you ever worked with regex's before? try this: perlre	[reply] [d/l]
Re: regex to remove all non a-z and spaces by davidrw (Prior) on May 16, 2005 at 02:46 UTC
`$string =~ s/[^a-z8-9 ]+//gi` [download] If 8-9 was a typo in OP, then change it to 0-9.	[reply] [d/l]
Re^2: regex to remove all non a-z and spaces by mrborisguy (Hermit) on May 16, 2005 at 02:52 UTC
a question on implementation... would the + make the regex any faster, since it matches more before it substitues? not that it matters at all in a case like this... just wondering, to file away in my "pieces of worthless trivia" section of the brain.	[reply]
Re^3: regex to remove all non a-z and spaces by davidrw (Prior) on May 16, 2005 at 02:57 UTC
I was thinking the same thing as i posted that, and I don't know -- I was actually thinking of posting it as a question.. i guess i'll do that now. Update: I forked this to a new thread: regex internals: quantifiers vs global match	[reply]
Re^4: regex to remove all non a-z and spaces by northwind (Hermit) on May 16, 2005 at 06:08 UTC
Re^2: regex to remove all non a-z and spaces by coldfingertips (Pilgrim) on May 16, 2005 at 02:52 UTC
I'm not good with regexes to any degree but doesn't ^ just mean to match at the beginning of the string? How is this removing everything but a-z and numbers?	[reply]
Re^3: regex to remove all non a-z and spaces by mrborisguy (Hermit) on May 16, 2005 at 02:53 UTC
usually, ^ does mean at the beginning, but the [ and ] make a character class, and a ^ at the beginning of a character class means "not any of these". Update: Oddly enough, it doesn't explicitly say that in perlre. however, it does say You can negate the [::] character classes by prefixing the class name with a '^'. This is a Perl extension.	[reply]
Re^4: regex to remove all non a-z and spaces by coldfingertips (Pilgrim) on May 16, 2005 at 02:57 UTC
Re^5: regex to remove all non a-z and spaces by mrborisguy (Hermit) on May 16, 2005 at 03:01 UTC
Some notes below your chosen depth have not been shown here
Re^4: regex to remove all non a-z and spaces by hv (Prior) on May 16, 2005 at 11:02 UTC
Re: regex to remove all non a-z and spaces by TedPride (Priest) on May 16, 2005 at 03:03 UTC
EDIT: Testing with a 1000-character randomly generated string for 100000 iterations: `$str =~ s/[^a-zA-Z8-9 ]//g;` : 30 seconds `$str =~ s/[^a-z8-9 ]//ig;` : 29 seconds `$str =~ s/[^a-zA-Z8-9 ]+//g;` : 11 seconds I wouldn't have expected the i flag solution to be more efficient, but there you go. There are significant savings from +. The most efficient regex therefore is: `$str =~ s/[^a-z8-9 ]+//ig;` [download] EDIT: You're right, tr/// is much faster.	[reply] [d/l] [select]
Re^2: regex to remove all non a-z and spaces by bmann (Priest) on May 16, 2005 at 04:09 UTC
Did you try `tr/a-zA-Z89//cd;`? I would expect it to be much faster than `s///`. Update - quick benchmark: use strict; use warnings; use Benchmark qw/cmpthese/; my $str; $str .= chr(rand( 96 ) + 32) for 1 .. 1000; my $d; sub trans { ($d = $str) =~ tr/a-zA-Z89 //cd; $d } sub justg { ($d = $str) =~ s/[^a-zA-Z8-9 ]//g; $d } sub ig { ($d = $str) =~ s/[^a-z8-9 ]//ig; $d } sub igplus{ ($d = $str) =~ s/[^a-z8-9 ]+//ig; $d } sub gplus { ($d = $str) =~ s/[^a-zA-Z8-9 ]+//g; $d} #print join "\n", trans, justg, ig, igplus, gplus; cmpthese ( 100_000, { trans => \&trans, justg => \&justg, ig => \&ig, gplus => \&gplus, igplus => \&igplus, }); __END__ Output: Rate justg ig igplus gplus trans justg 4442/s -- -2% -19% -22% -93% ig 4529/s 2% -- -18% -20% -92% igplus 5499/s 24% 21% -- -3% -91% gplus 5674/s 28% 25% 3% -- -91% trans 60168/s 1255% 1229% 994% 960% -- [download] I ran it with strings of 10 char, 100, then 1000. The longer the string, the bigger the difference between tr/// and s///.	[reply] [d/l] [select]
Re^2: regex to remove all non a-z and spaces by coldfingertips (Pilgrim) on May 16, 2005 at 04:19 UTC
This regex isn't exactly working for me. `$cleaned_search =~ s/[^a-z0-9 ]+//gi;` [download] I keep getting files named "this+is a test.html", "the+farmer saves the day" and "i+hate this stuff.html". I thought it was the + so I removed it and the regex is still adding a + sign in my string when it wasn't there before. I also can't add any weird characters because it errors out when it tries to create the file. IE: if I used \ it said it failed to open on a closed filehandle or something. So it seems this regex isn't working at all.	[reply] [d/l]
Re^3: regex to remove all non a-z and spaces by muntfish (Chaplain) on May 16, 2005 at 08:56 UTC
There's no way that statement on its own would add a + sign. I think you need to give us a bit more context in order to find the cause of the problem: a few more lines of your code; what your input looks like; what your output looks like; and what you expect your output to look like... Then we'll be able to help you out, hopefully. s^^unp(;75N=&9I<V@`ack(u,^;s\|$.+\`\|"$`$'\"$&\"$"\|ee;/m.+h/&&print$&	[reply] [d/l]