http://www.perlmonks.org?node_id=763851

cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I've been reviewing an old script I wrote that extracts groups of words. The current code I have for getting words in pairs is:-
my $text = "hello to all the perl monks"; while ($text =~ /\b([A-Za-z'\-]+ [A-Za-z'\-]+)\b/g) { print "$1\n"; }#while
I've just found a bug as this isn't doing what I expected. I thought the output would be:-
hello to to all all the the perl perl monks
But the output is actually:-
hello to all the perl monks
I was about to just run the regexp twice, removing the first word, but this seemed like a nasty fix. I'm guessing there is a better way to get the result I want?

Thanks in advance

Lyle

Replies are listed 'Best First'.
Re: Regexp matching words, not doing what I expect
by moritz (Cardinal) on May 13, 2009 at 18:11 UTC
    The /g modifier prevents you from matching overlapping pieces of texts, so don't do that.

    A workaround is to only match the first word normally, and match the second one in a look-ahead:

    my $text = "hello to all the perl monks"; while ($text =~ /\b([A-Za-z'\-]+) (?=([A-Za-z'\-]+))\b/g) { print "$1 $2\n"; }

    (Gives the desired output).

    The key is that look-ahead groups (?=...) match, but don't consume any characters, so the position of the next match is not affected by what that group matched. See perlre for details, or "Mastering Regular Expression" by J. Friedl.

      Thanks Moritz, look-aheads are new to me. This knowledge is going to prove very useful thanks :)
Re: Regexp matching words, not doing what I expect
by graff (Chancellor) on May 14, 2009 at 03:11 UTC
    You should consider ditching the regex approach and using a module: Text::Ngrams looks like the right one for you (not to be confused with Text::Ngram -- note the "singular" -- which only works on character ngrams, not word ngrams).

    If you have some strange compulsion not to use a module, you should still consider ditching the regex -- split the string into an array of words and use a for loop to output your word pairs:

    my @words = split " ", $text; print "$words[$_-1] $words[$_]\n" for ( 1 .. $#words );
    (updated to fix bone-headed error in for loop)
Re: Regexp matching words, not doing what I expect
by JavaFan (Canon) on May 13, 2009 at 19:36 UTC
    $_ = "hello to all the perl monks"; /\b(([A-Za-z'\-]+) (?-1))\b(?{ say $1 })(*FAIL)/; __END__ hello to to all all the the perl perl monks
      Or without the extended regex features of 5.10 (basically moritz's approach, but with only one capture):
      >perl -wMstrict -le "my $text = q{hello to 'all' this perl-monk's friends}; my $wchar = qr{ [A-Za-z'-] }xms; my $word = qr{ (?<! $wchar) $wchar+ }xms; while ($text =~ m{ (?= ($word \s+ $word)) }xmsg) { print $1 } " hello to to 'all' 'all' this this perl-monk's perl-monk's friends
        The 5.10 features can easily be avoided: the (?-1) can be replaced by just copying the [A-Za-z'-] part, and the (*FAIL) can be replaced with (?!).
Re: Regexp matching words, not doing what I expect
by generator (Pilgrim) on May 13, 2009 at 18:05 UTC
    Remove the space after the plus sign and run it again.

    UPDATE: Ignore my reply, I was obviously caffiene deprived and drifted from the OP's intent

    <><

    generator