Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: phrase match

by JadeNB (Chaplain)
on Dec 13, 2009 at 00:06 UTC ( #812559=note: print w/replies, xml ) Need Help??


in reply to phrase match

It seems that your description of the problem almost writes the regex itself, namely,

qr/(?<=^| )($phrases_re)(?= |$)/
This says, literally, “one of the phrases in $phrases_re, preceded by a space or the beginning of the string” (which you didn't specify, but I assume you meant) “and followed by a space or the end of the string.”

UPDATE: Thanks to Crackers2 for pointing out that my original version, qr/(?:^| )($phrases_re)(?: |$)/, didn't work correctly, and that look-around would fix it.
UPDATE 2: Oops, I should have tested—as ambrus observed, this one doesn't work, either, for the silly reason that it doesn't compile. :-) That post and its descendants have some solutions.

Replies are listed 'Best First'.
Re^2: phrase match
by Crackers2 (Parson) on Dec 13, 2009 at 00:49 UTC

    That has two problems:

    1) Because you don't capture the space-or-start/end-of-line, the result will be missing some spaces:

    kinase inhibitor#SET6#activates#p16(INK4A)#in cell-wall.
    This can be fixed by using something like
    $sentence =~ s/(^| )($phrases_re)( |$)/$1\#$2\#$3/g;

    2) Because the spaces are part of the match, it won't be able to match patterns if they're consecutive in the source string. i.e. if you add 'activates' to the list of phrases, it won't notice it because the space preceding it has been eaten by the match for SET6. Solving this probably involves some simple lookahead/lookbehind logic to grab the spaces instead of actually matching them, but I've never been good at those so I don't have the actual regex for it.

      Here's an approach that seems to satisfy the OPer's (somewhat vaguely expressed, and with the inferred qualifications noted by others, and including a sentence ending with a period) requirements (needs Perl 5.10  \K regex enhancement):

      >perl -wMstrict -le "my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK4A)', 'cell', ); my $delim = qr{ \. | \A \s* | \s+ | \s* \z }xms; my $phrase = join '|', reverse sort map quotemeta, @phrases; my $mark = qq{\x23}; for my $s (@ARGV) { print '--------------'; print $s; $s =~ s{ $delim \K ($phrase) (?= $delim) } {$mark$1$mark}xmsg; print $s; } " "cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6 +." "kinase tor tor SET6" "tor tor SET6 kinase" "tor tor SET6" "kinase tor tor SET6." "tor tor SET6 kinase." "tor tor SET6." "kinase inhibitor" "kinase inhibitor." -------------- cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6. #cell# kinase inhibitor #SET6# activates #p16(INK4A)# in cell-wall #to +r SET6#. -------------- kinase tor tor SET6 kinase #tor# #tor SET6# -------------- tor tor SET6 kinase #tor# #tor SET6# kinase -------------- tor tor SET6 #tor# #tor SET6# -------------- kinase tor tor SET6. kinase #tor# #tor SET6#. -------------- tor tor SET6 kinase. #tor# #tor SET6# kinase. -------------- tor tor SET6. #tor# #tor SET6#. -------------- kinase inhibitor kinase inhibitor -------------- kinase inhibitor. kinase inhibitor.

      (Note:  "\x23" is the  "#" character. Have to do this because of a peculiarity of my command line 'editor'.)

      If Perl version 5.10 is not available, use
          s{ ($delim) ($phrase) (?= $delim) }{$1$mark$2$mark}xmsg;
      as the substitution regex (tested).

      Of course, a lot more testing is recommended!

      The  reverse in the
          my $phrase = join '|',reverse sort map quotemeta, @phrases;
      statement causes the ordered alternation to match the longest phrase substring.

      See also Regexp::Assemble and related modules for other (and perhaps better) ways to compile the  $phrase regex.

Re^2: phrase match
by ambrus (Abbot) on Dec 13, 2009 at 09:46 UTC

    This fixed version won't work, it gives the error Variable length lookbehind not implemented in regex.

      A way around that is to use an alternation of look-behinds ...

      qr/(?x) (?: (?<= \s ) | (?<= ^ ) ) ( $phrases_re ) (?= \s | $ )/

      ... although it is debateable whether this is clearer than your suggestions. In general I prefer look-arounds to replacing text with unaltered captures but that's just me.

      Cheers,

      JohnGG

      An effective variable length look-behind is available in Perl 5.10 with the  \K special escape. The following compiles
          my $rx = qr/(?:^| )\K($phrases_re)(?= |$)/;
      but whether it serves the OPer's true needs is another question.

        That is useful sometimes, but here it's not needed, because a lookahead is enough.

        Run this:

        use warnings; $sentence='kinase inhibitor SET6 activates p16(INK4A) in cell-wall.'; my @phrases = ('kinase i', 'inhibitor', 'tor SET6', 'SET6', 'p16(INK4A +)', 'cell'); my $phrases_re = join '|', map { quotemeta } @phrases; $sentence =~ s/(^| )($phrases_re)(?= |$)/$1#$2#/g; print $sentence, "\n";

        You get the output

        kinase #inhibitor# #SET6# activates #p16(INK4A)# in cell-wall.

        Update: There are ways to do this kind of thing without lookaheads or lookbehinds, just as a curiosity. Replace the substitution statement above with either

        $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;
        or
        use 5.010; given ($sentence) { s/ / /g; s/(^| )($phrases_re)( |$)/$1# +$2#$3/g; s/ / /g; }

        Update: One more alternative is below.

        my %phrase; $phrase{$_}++ for @phrases; my @sentence = split /( +)/, $sentence; for (@sentence) { $phrase{$_} and $_ = "#" . $_ . "#"; }; $sentence = join "", @sentence;

        Update: Oh, let's not forget this one either.

        $sentence =~ s/(?<![^ ])($phrases_re)(?= |$)/#$1#/g;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://812559]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2021-05-16 00:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (151 votes). Check out past polls.

    Notices?