Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: regex to remove all non a-z and spaces

by TedPride (Priest)
on May 16, 2005 at 03:03 UTC ( #457348=note: print w/replies, xml ) Need Help??


in reply to regex to remove all non a-z and spaces

EDIT: Testing with a 1000-character randomly generated string for 100000 iterations: $str =~ s/[^a-zA-Z8-9 ]//g; : 30 seconds
$str =~ s/[^a-z8-9 ]//ig; : 29 seconds
$str =~ s/[^a-zA-Z8-9 ]+//g; : 11 seconds

I wouldn't have expected the i flag solution to be more efficient, but there you go. There are significant savings from +. The most efficient regex therefore is:

$str =~ s/[^a-z8-9 ]+//ig;
EDIT: You're right, tr/// is much faster.

Replies are listed 'Best First'.
Re^2: regex to remove all non a-z and spaces
by bmann (Priest) on May 16, 2005 at 04:09 UTC
    Did you try tr/a-zA-Z89//cd;? I would expect it to be much faster than s///.

    Update - quick benchmark:

    use strict; use warnings; use Benchmark qw/cmpthese/; my $str; $str .= chr(rand( 96 ) + 32) for 1 .. 1000; my $d; sub trans { ($d = $str) =~ tr/a-zA-Z89 //cd; $d } sub justg { ($d = $str) =~ s/[^a-zA-Z8-9 ]//g; $d } sub ig { ($d = $str) =~ s/[^a-z8-9 ]//ig; $d } sub igplus{ ($d = $str) =~ s/[^a-z8-9 ]+//ig; $d } sub gplus { ($d = $str) =~ s/[^a-zA-Z8-9 ]+//g; $d} #print join "\n", trans, justg, ig, igplus, gplus; cmpthese ( 100_000, { trans => \&trans, justg => \&justg, ig => \&ig, gplus => \&gplus, igplus => \&igplus, }); __END__ Output: Rate justg ig igplus gplus trans justg 4442/s -- -2% -19% -22% -93% ig 4529/s 2% -- -18% -20% -92% igplus 5499/s 24% 21% -- -3% -91% gplus 5674/s 28% 25% 3% -- -91% trans 60168/s 1255% 1229% 994% 960% --
    I ran it with strings of 10 char, 100, then 1000. The longer the string, the bigger the difference between tr/// and s///.
Re^2: regex to remove all non a-z and spaces
by coldfingertips (Pilgrim) on May 16, 2005 at 04:19 UTC
    This regex isn't exactly working for me.
    $cleaned_search =~ s/[^a-z0-9 ]+//gi;
    I keep getting files named "this+is a test.html", "the+farmer saves the day" and "i+hate this stuff.html".

    I thought it was the + so I removed it and the regex is still adding a + sign in my string when it wasn't there before. I also can't add any weird characters because it errors out when it tries to create the file. IE: if I used \ it said it failed to open on a closed filehandle or something.

    So it seems this regex isn't working at all.

      There's no way that statement on its own would add a + sign.

      I think you need to give us a bit more context in order to find the cause of the problem: a few more lines of your code; what your input looks like; what your output looks like; and what you expect your output to look like...

      Then we'll be able to help you out, hopefully.


      s^^unp(;75N=&9I<V@`ack(u,^;s|\(.+\`|"$`$'\"$&\"\)"|ee;/m.+h/&&print$&

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://457348]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2020-02-21 10:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (94 votes). Check out past polls.

    Notices?