Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Regexp matching words, not doing what I expect

by cosmicperl (Chaplain)
on May 13, 2009 at 17:54 UTC ( [id://763851]=perlquestion: print w/replies, xml ) Need Help??

cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I've been reviewing an old script I wrote that extracts groups of words. The current code I have for getting words in pairs is:-
my $text = "hello to all the perl monks"; while ($text =~ /\b([A-Za-z'\-]+ [A-Za-z'\-]+)\b/g) { print "$1\n"; }#while
I've just found a bug as this isn't doing what I expected. I thought the output would be:-
hello to to all all the the perl perl monks
But the output is actually:-
hello to all the perl monks
I was about to just run the regexp twice, removing the first word, but this seemed like a nasty fix. I'm guessing there is a better way to get the result I want?

Thanks in advance

Lyle

Replies are listed 'Best First'.
Re: Regexp matching words, not doing what I expect
by moritz (Cardinal) on May 13, 2009 at 18:11 UTC
    The /g modifier prevents you from matching overlapping pieces of texts, so don't do that.

    A workaround is to only match the first word normally, and match the second one in a look-ahead:

    my $text = "hello to all the perl monks"; while ($text =~ /\b([A-Za-z'\-]+) (?=([A-Za-z'\-]+))\b/g) { print "$1 $2\n"; }

    (Gives the desired output).

    The key is that look-ahead groups (?=...) match, but don't consume any characters, so the position of the next match is not affected by what that group matched. See perlre for details, or "Mastering Regular Expression" by J. Friedl.

      Thanks Moritz, look-aheads are new to me. This knowledge is going to prove very useful thanks :)
Re: Regexp matching words, not doing what I expect
by graff (Chancellor) on May 14, 2009 at 03:11 UTC
    You should consider ditching the regex approach and using a module: Text::Ngrams looks like the right one for you (not to be confused with Text::Ngram -- note the "singular" -- which only works on character ngrams, not word ngrams).

    If you have some strange compulsion not to use a module, you should still consider ditching the regex -- split the string into an array of words and use a for loop to output your word pairs:

    my @words = split " ", $text; print "$words[$_-1] $words[$_]\n" for ( 1 .. $#words );
    (updated to fix bone-headed error in for loop)
Re: Regexp matching words, not doing what I expect
by JavaFan (Canon) on May 13, 2009 at 19:36 UTC
    $_ = "hello to all the perl monks"; /\b(([A-Za-z'\-]+) (?-1))\b(?{ say $1 })(*FAIL)/; __END__ hello to to all all the the perl perl monks
      Or without the extended regex features of 5.10 (basically moritz's approach, but with only one capture):
      >perl -wMstrict -le "my $text = q{hello to 'all' this perl-monk's friends}; my $wchar = qr{ [A-Za-z'-] }xms; my $word = qr{ (?<! $wchar) $wchar+ }xms; while ($text =~ m{ (?= ($word \s+ $word)) }xmsg) { print $1 } " hello to to 'all' 'all' this this perl-monk's perl-monk's friends
        The 5.10 features can easily be avoided: the (?-1) can be replaced by just copying the [A-Za-z'-] part, and the (*FAIL) can be replaced with (?!).
Re: Regexp matching words, not doing what I expect
by generator (Pilgrim) on May 13, 2009 at 18:05 UTC
    Remove the space after the plus sign and run it again.

    UPDATE: Ignore my reply, I was obviously caffiene deprived and drifted from the OP's intent

    <><

    generator

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://763851]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-05-24 11:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found