Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re^3: phrase match

by AnomalousMonk (Bishop)
on Dec 13, 2009 at 12:20 UTC ( #812581=note: print w/replies, xml ) Need Help??

in reply to Re^2: phrase match
in thread phrase match

An effective variable length look-behind is available in Perl 5.10 with the  \K special escape. The following compiles
    my $rx = qr/(?:^| )\K($phrases_re)(?= |$)/;
but whether it serves the OPer's true needs is another question.

Replies are listed 'Best First'.
Re^4: phrase match
by ambrus (Abbot) on Dec 13, 2009 at 12:58 UTC

    That is useful sometimes, but here it's not needed, because a lookahead is enough.

    Run this:

    use warnings; $sentence='kinase inhibitor SET6 activates p16(INK4A) in cell-wall.'; my @phrases = ('kinase i', 'inhibitor', 'tor SET6', 'SET6', 'p16(INK4A +)', 'cell'); my $phrases_re = join '|', map { quotemeta } @phrases; $sentence =~ s/(^| )($phrases_re)(?= |$)/$1#$2#/g; print $sentence, "\n";

    You get the output

    kinase #inhibitor# #SET6# activates #p16(INK4A)# in cell-wall.

    Update: There are ways to do this kind of thing without lookaheads or lookbehinds, just as a curiosity. Replace the substitution statement above with either

    $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;
    use 5.010; given ($sentence) { s/ / /g; s/(^| )($phrases_re)( |$)/$1# +$2#$3/g; s/ / /g; }

    Update: One more alternative is below.

    my %phrase; $phrase{$_}++ for @phrases; my @sentence = split /( +)/, $sentence; for (@sentence) { $phrase{$_} and $_ = "#" . $_ . "#"; }; $sentence = join "", @sentence;

    Update: Oh, let's not forget this one either.

    $sentence =~ s/(?<![^ ])($phrases_re)(?= |$)/#$1#/g;

      Thanks for pointing out the error in my ‘fixed’ code!

      $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;

      I wanted to point out a non-error in your correction above, since it took me a minute to understand what its purpose was: If you just did the global replacement without the for modifier, then you'd have the same problem that Crackers2 pointed out with my original, that overlapping matches wouldn't be handled (because the leading space of the trailing match would already have been gobbled up by the trailing space of the leading space). If I'm understanding correctly, then the for 0, 1 is just making another pass to pick up any matches that we missed this way.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://812581]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2021-06-14 14:42 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (62 votes). Check out past polls.