Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

phrase match

by newbio (Beadle)
on Dec 12, 2009 at 23:25 UTC ( [id://812554]=perlquestion: print w/replies, xml ) Need Help??

newbio has asked for the wisdom of the Perl Monks concerning the following question:

my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK +4A)', 'cell' ); $sentence='kinase inhibitor SET6 activates p16(INK4A) in cell-wall.'; #Desired Output: 'kinase inhibitor #SET6# activates #p16(INK4A)# in ce +ll-wall' #I am trying to solve like this: my $phrases_re = join '|', map { quotemeta } @phrases; $sentence =~ s/($phrases_re)/\#$1\#/g;

The problem is that the above solution also selects partial matches in the sentence. However, I want to only tag 'complete' phrases/words, i.e. those that are separated by spaces on either side (exception being last word in the sentence). Any suggestions?

Thanks.

UPDATE: put code in tags; thanks JadeNB for noting that.

Replies are listed 'Best First'.
Re: phrase match
by Corion (Patriarch) on Dec 12, 2009 at 23:36 UTC

    You could match the target string with a space appended for your words with a space appended, or you could use the \b word boundary marker (see perlre) if your "words" fit that description.

Re: phrase match
by JadeNB (Chaplain) on Dec 13, 2009 at 00:06 UTC

    It seems that your description of the problem almost writes the regex itself, namely,

    qr/(?<=^| )($phrases_re)(?= |$)/
    This says, literally, “one of the phrases in $phrases_re, preceded by a space or the beginning of the string” (which you didn't specify, but I assume you meant) “and followed by a space or the end of the string.”

    UPDATE: Thanks to Crackers2 for pointing out that my original version, qr/(?:^| )($phrases_re)(?: |$)/, didn't work correctly, and that look-around would fix it.
    UPDATE 2: Oops, I should have tested—as ambrus observed, this one doesn't work, either, for the silly reason that it doesn't compile. :-) That post and its descendants have some solutions.

      That has two problems:

      1) Because you don't capture the space-or-start/end-of-line, the result will be missing some spaces:

      kinase inhibitor#SET6#activates#p16(INK4A)#in cell-wall.
      This can be fixed by using something like
      $sentence =~ s/(^| )($phrases_re)( |$)/$1\#$2\#$3/g;

      2) Because the spaces are part of the match, it won't be able to match patterns if they're consecutive in the source string. i.e. if you add 'activates' to the list of phrases, it won't notice it because the space preceding it has been eaten by the match for SET6. Solving this probably involves some simple lookahead/lookbehind logic to grab the spaces instead of actually matching them, but I've never been good at those so I don't have the actual regex for it.

        Here's an approach that seems to satisfy the OPer's (somewhat vaguely expressed, and with the inferred qualifications noted by others, and including a sentence ending with a period) requirements (needs Perl 5.10  \K regex enhancement):

        >perl -wMstrict -le "my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK4A)', 'cell', ); my $delim = qr{ \. | \A \s* | \s+ | \s* \z }xms; my $phrase = join '|', reverse sort map quotemeta, @phrases; my $mark = qq{\x23}; for my $s (@ARGV) { print '--------------'; print $s; $s =~ s{ $delim \K ($phrase) (?= $delim) } {$mark$1$mark}xmsg; print $s; } " "cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6 +." "kinase tor tor SET6" "tor tor SET6 kinase" "tor tor SET6" "kinase tor tor SET6." "tor tor SET6 kinase." "tor tor SET6." "kinase inhibitor" "kinase inhibitor." -------------- cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6. #cell# kinase inhibitor #SET6# activates #p16(INK4A)# in cell-wall #to +r SET6#. -------------- kinase tor tor SET6 kinase #tor# #tor SET6# -------------- tor tor SET6 kinase #tor# #tor SET6# kinase -------------- tor tor SET6 #tor# #tor SET6# -------------- kinase tor tor SET6. kinase #tor# #tor SET6#. -------------- tor tor SET6 kinase. #tor# #tor SET6# kinase. -------------- tor tor SET6. #tor# #tor SET6#. -------------- kinase inhibitor kinase inhibitor -------------- kinase inhibitor. kinase inhibitor.

        (Note:  "\x23" is the  "#" character. Have to do this because of a peculiarity of my command line 'editor'.)

        If Perl version 5.10 is not available, use
            s{ ($delim) ($phrase) (?= $delim) }{$1$mark$2$mark}xmsg;
        as the substitution regex (tested).

        Of course, a lot more testing is recommended!

        The  reverse in the
            my $phrase = join '|',reverse sort map quotemeta, @phrases;
        statement causes the ordered alternation to match the longest phrase substring.

        See also Regexp::Assemble and related modules for other (and perhaps better) ways to compile the  $phrase regex.

      This fixed version won't work, it gives the error Variable length lookbehind not implemented in regex.

        A way around that is to use an alternation of look-behinds ...

        qr/(?x) (?: (?<= \s ) | (?<= ^ ) ) ( $phrases_re ) (?= \s | $ )/

        ... although it is debateable whether this is clearer than your suggestions. In general I prefer look-arounds to replacing text with unaltered captures but that's just me.

        Cheers,

        JohnGG

        An effective variable length look-behind is available in Perl 5.10 with the  \K special escape. The following compiles
            my $rx = qr/(?:^| )\K($phrases_re)(?= |$)/;
        but whether it serves the OPer's true needs is another question.
Re: phrase match
by dsheroh (Monsignor) on Dec 13, 2009 at 11:03 UTC
    For a small dataset such as this, I'd look closely at \b to see whether your word boundary conditions match \b's well enough for that to work for you. If so, just change your replacement expression to $sentence =~ s/\b($phrases_re)\b/\#$1\#/g; and you should be set.

    For a larger dataset, or if \b doesn't quite work for you, take a look at Regexp::Assemble, which will both build you a more efficient regex and provide the anchor_word_begin and anchor_word_end settings, which may or may not deal more effectively with your "only match complete items" requirement.

Re: phrase match
by newbio (Beadle) on Dec 13, 2009 at 19:30 UTC

    Thank you all Monks for your comments.

    Here is my reworked solution, it seems to work on my sample sentences, but I am not really sure if it will work in all types of situations. If you see any glitch with this solution, please let me know.

    Thanks once again.

    my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK +4A)', 'cell', 'MAP', 'H1 inhibitor' ); my $sentence='kinase inhibitor SET6 MAP H1 inhibitor activates p16(INK +4A) in cell-wall'; for (my $i=0;$i<=$#phrases;$i++) { $phrases[$i]=~s/\s+/ /g; } my $phrases_re = join '|', map { quotemeta } @phrases; $sentence=~s/\s+/ /g; $sentence=' '.$sentence.' '; $sentence =~ s/\s($phrases_re)\s/ \#$1\# /g; $sentence=~s/\s+/ /g; $se +ntence =~ s/^\s+|\s+$//g; print "$sentence\n"; #Output: 'kinase inhibitor #SET6# #MAP# #H1 inhibitor# activates #p16( +INK4A)# in cell-wall'

      I'm not sure I understand why the poor sentence must be mauled so relentlessly in your final approach, but it's fine with me if it works for you.

      I note that the approach you use does not seem to take account of longest versus shortest matches:  'tor SET6' can never match because  'tor' precedes it in the ordered alternation. Perhaps this is your intent, but be aware that as the code stands, longest-shortest matching behavior depends on the order in which phrases appear in the phrase list. (This is touched on in paragraph 5 of Re^3: phrase match.) See example below.

      I also note there is still no provision for a 'sentence' ending in a period, although again, perhaps this contingency will never arise. Example also below.

      >perl -wMstrict -le "my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK4A)', 'cell', 'MAP', 'H1 inhibitor', 'foo bar', 'foo', 'bar', ); for (my $i=0;$i<=$#phrases;$i++) { $phrases[$i]=~s/\s+/ /g; } my $phrases_re = join '|', map { quotemeta } @phrases; for my $sentence (@ARGV) { print '------------------'; print $sentence; $sentence=~s/\s+/ /g; $sentence=' '.$sentence.' '; $sentence =~ s/\s($phrases_re)\s/ \x23$1\x23 /g; $sentence=~s/\s+/ /g; $sentence =~ s/^\s+|\s+$//g; print $sentence; } " "kinase inhibitor SET6 MAP H1 inhibitor activates p16(INK4A) in cell-w +all" "tor tor SET6 SET6" "SET6 tor SET6" "tor tor SET6 SET6." "foo bar bar" "foo foo bar bar" ------------------ kinase inhibitor SET6 MAP H1 inhibitor activates p16(INK4A) in cell-wa +ll kinase inhibitor #SET6# #MAP# #H1 inhibitor# activates #p16(INK4A)# in + cell-wall ------------------ tor tor SET6 SET6 #tor# #tor# #SET6# #SET6# ------------------ SET6 tor SET6 #SET6# #tor# #SET6# ------------------ tor tor SET6 SET6. #tor# #tor# #SET6# SET6. ------------------ foo bar bar #foo bar# #bar# ------------------ foo foo bar bar #foo# #foo bar# #bar#

        All very good points AnomalousMonk!

        >I'm not sure I understand why the poor sentence must be mauled so relentlessly in your final approach, but it's fine with me if it works for you.

        I was just experimenting a few more things.

        >I note that the approach you use does not seem to take account of longest versus shortest matches: 'tor SET6' can never match because 'tor' precedes it in the ordered alternation. Perhaps this is your intent, but be aware that as the code stands, longest-shortest matching behavior depends on the order in which phrases appear in the phrase list. (This is touched on in paragraph 5 of Re^3: phrase match.) See example below.

        Very good point. Yes, my 'phrase list' would be in the decreasing order of phrase string length.

        >I also note there is still no provision for a 'sentence' ending in a period, although again, perhaps this contingency will never arise. Example also below.

        Yes, I will have the period removed in a preprocessing step.

        Thanks a lot again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://812554]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-03-29 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found