Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: stripped punctuation

by thealienz1 (Pilgrim)
on Oct 06, 2005 at 20:20 UTC ( #498036=note: print w/replies, xml ) Need Help??


in reply to Re: stripped punctuation
in thread stripped punctuation

After looking at your regexp I took to simplifying my needs with:

$word =~ s/^[^\w\d]+(.*?)[^\w\d]+$/$1/;

My intention is remove everything that is not a letter or number up to the first letter, pull everything up till the last non letter or digit. When I look at it it makes sense, but my testing it does not work.

Update

It works on the simple example I gave for 'Wilmer!'. I was running word count with a script as the input and the odd results I was seeing were the syntax in the script. I apologize.

Replies are listed 'Best First'.
Re^3: stripped punctuation
by fishbot_v2 (Chaplain) on Oct 06, 2005 at 20:50 UTC

    Except you want to strip punctuation from the beginning or end. The above regex only works if there is punctuation at both beginning and end.

    If removing any trailing/leading punctuation is in fact your goal, what about something like:

    use strict; use warnings; my $word = 'Wilmer",'; $word =~ s/^ \W*? # ignore any leading punc ( \w .*? ) # swallow everything lazily (?: \W+ )? $ # ignore any trailing punc /$1/x; print $word;

    Update: Mind you, at that point, a much simpler regex will likely serve you better in terms of speed and readability:

    $word =~ s/(?:^\W+)|(?:\W+$)//g;

    Final update - benchmark:

    Rate capture non_capture capture 16561/s -- -28% non_capture 22861/s 38% --

    The second suggestion is about 30% faster, on average.

    Additionally, \w doesn't mean what you think it means.

      I did basically your second regexp there in two steps. I will try the yours, though. I am curious the difference in speed between them. Of course I am wondering what you mean by \w doesn't mean what I think I mean.

        from perlre:

        A "\w" matches a single alphanumeric character (an alphabetic character, or a decimal digit) or "_"...

        Thus your earlier use of [^\w\d] had the set of digits in it twice, which suggested to me that you thought that \w means [A-Za-z].

        [^\w\d] works, but is redundant and equivalent to \W

        Update: You asked what the speed difference between the two passes and one pass:

        s/(?:^\W+)|(?:\W+$)//g; # versus s/\W+$//g; s/^\W+//g; # my unscientific benchmark Rate single_pass two_pass single_pass 15829/s -- -11% two_pass 17737/s 12% --

        Doing it in two passes seems to be about 10-15% faster.

        The \w means "any alphanumeric character or underscore." So in your regex, where you have [^\w\d] it's a bit redundant. \w can be replaced by [a-zA-Z0-9_] so you're writing [a-zA-Z0-9_0-9] in the regexen above.

        Also note that since \w includes the underscore you're matching more than what you say you want.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://498036]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2019-10-17 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?