Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Grouped characters inside character class.

by the_0ne (Pilgrim)
on Jun 02, 2006 at 01:10 UTC ( #553195=perlquestion: print w/ replies, xml ) Need Help??
the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I have a regex question for you. I have this string of text...
Posted by mad max beyond eggdome on September 04, 2003
Using this regex...
source =~ /^posted by((\w|\s)+)\son\s/i;
I can get "mad max beyond eggdome", which is what I need to pull out. I need to grab everything between "posted by" and the word "on". My problem with that regex is there could be other characters other than \w or \s in between posted by and (space)on(space) and the word "on" could also be a possibility.

So, I tried this...
source =~ /^posted by([^\son\s]+)\son\s/i;
However, a character class is just that, a character. It's not a group. I can't figure out how to group the characters with a NOT.

I tried this also...
source =~ /^posted by([^(\son\s)]+)\son\s/i;
Which I thought would group the (space)on(space) in the character class, but that did not work either. How do you group the characters to say "not group of characters"? I want to say...

Start at the beginning, look for the string "posted by" and then gather all characters that are not (space)on(space), until I find the string (space)on(space). I don't think I'm going in the correct direction.

Thanks for any help, it is greatly appreciated.

Comment on Grouped characters inside character class.
Select or Download Code
Re: Grouped characters inside character class.
by Enlil (Parson) on Jun 02, 2006 at 01:39 UTC
    This works:
    use strict; use warnings; my $source = 'Posted by mad max beyond eggdome on September 04, 2003'; if ( $source =~ /^Posted by (.*?) on /i ) { print qq("$1") . "\n"; }
    which matches:
    C:\>perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr +/^Posted by (.*?) on /)->explain()" The regular expression: (?-imsx:^Posted by (.*?) on ) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- Posted by 'Posted by ' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .*? any character except \n (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- on ' on ' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    -enlil

      Thanks enlil for the response. That does work. My only problem is, I've been warned on perlmonks several times of using .*. I guess in this case it would be fine though because I do want everything grabbed up until the (space)on(space). Maybe I was just trying to be too fancy. :)
      The only issue with this regex (and the poster's original idea as well) is it will not properly capture the username if it contains ' on '. For example:

      my $source = 'Posted by getting on your nerves on September 04, 2003';

      It's probably a good idea to anchor on more than just the ' on ' part like:

      my $source = 'Posted by getting on your nerves on September 04, 2003'; if ($source =~ /Posted by (.+?) on \w+ \d{2}, \d{4}$/) { ... }

      Regards

      m.att

        The only issue with this regex (and the poster's original idea as well) is it will not properly capture the username if it contains ' on '

        Noted. So we capture up until the last ' on '.

        use strict; use warnings; my $source = 'Posted by getting on my nerves on September 04, 2003'; if ( $source =~ /^Posted by (.*?) on (?!.* on )/i ) { print qq("$1") . "\n"; }

        blokhead is right.. and I will go lick my wounds now.

Re: Grouped characters inside character class.
by Zaxo (Archbishop) on Jun 02, 2006 at 01:47 UTC

    The word boundary device, \b, is useful for weeding out coincidental inclusion in "beyond". With space expected, you probably don't need it. You can try to match minimal or maximal length with quantifiers. I agree you're headed the wrong direction by trying to exclude characters. Does this do what you want?

    if ($source =~ /posted by (.*?) on /i) { #do something with $1 }

    Your code lacks a sigil on $source, and doesn't really do anything because it's evaluated in void context.

    After Compline,
    Zaxo

      Your code lacks a sigil on $source, and doesn't really do anything because it's evaluated in void context.

      Sorry, I actually am coding this in Ruby. However, when I have a problem in other languages, I usually jump right to a perl -e ' code_block;', especially regexes. And whenever i have any kind of a problem that the perlmonks can help me out in, I go right here first...

      Update:

      Yes, it does do what I want. I just get leary when using .*. I've seen monks chastised a few times on perlmonks for that. But in this case, it should be fine. I was just wondering if there was another way.

      Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://553195]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2014-10-31 19:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (223 votes), past polls