http://www.perlmonks.org?node_id=812583


in reply to Re^3: phrase match
in thread phrase match

That is useful sometimes, but here it's not needed, because a lookahead is enough.

Run this:

use warnings; $sentence='kinase inhibitor SET6 activates p16(INK4A) in cell-wall.'; my @phrases = ('kinase i', 'inhibitor', 'tor SET6', 'SET6', 'p16(INK4A +)', 'cell'); my $phrases_re = join '|', map { quotemeta } @phrases; $sentence =~ s/(^| )($phrases_re)(?= |$)/$1#$2#/g; print $sentence, "\n";

You get the output

kinase #inhibitor# #SET6# activates #p16(INK4A)# in cell-wall.

Update: There are ways to do this kind of thing without lookaheads or lookbehinds, just as a curiosity. Replace the substitution statement above with either

$sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;
or
use 5.010; given ($sentence) { s/ / /g; s/(^| )($phrases_re)( |$)/$1# +$2#$3/g; s/ / /g; }

Update: One more alternative is below.

my %phrase; $phrase{$_}++ for @phrases; my @sentence = split /( +)/, $sentence; for (@sentence) { $phrase{$_} and $_ = "#" . $_ . "#"; }; $sentence = join "", @sentence;

Update: Oh, let's not forget this one either.

$sentence =~ s/(?<![^ ])($phrases_re)(?= |$)/#$1#/g;

Replies are listed 'Best First'.
Re^5: phrase match
by JadeNB (Chaplain) on Dec 13, 2009 at 18:29 UTC

    Thanks for pointing out the error in my ‘fixed’ code!

    $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;

    I wanted to point out a non-error in your correction above, since it took me a minute to understand what its purpose was: If you just did the global replacement without the for modifier, then you'd have the same problem that Crackers2 pointed out with my original, that overlapping matches wouldn't be handled (because the leading space of the trailing match would already have been gobbled up by the trailing space of the leading space). If I'm understanding correctly, then the for 0, 1 is just making another pass to pick up any matches that we missed this way.