Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re^2: phrase match

by Crackers2 (Parson)
on Dec 13, 2009 at 00:49 UTC ( #812562=note: print w/replies, xml ) Need Help??

in reply to Re: phrase match
in thread phrase match

That has two problems:

1) Because you don't capture the space-or-start/end-of-line, the result will be missing some spaces:

kinase inhibitor#SET6#activates#p16(INK4A)#in cell-wall.
This can be fixed by using something like
$sentence =~ s/(^| )($phrases_re)( |$)/$1\#$2\#$3/g;

2) Because the spaces are part of the match, it won't be able to match patterns if they're consecutive in the source string. i.e. if you add 'activates' to the list of phrases, it won't notice it because the space preceding it has been eaten by the match for SET6. Solving this probably involves some simple lookahead/lookbehind logic to grab the spaces instead of actually matching them, but I've never been good at those so I don't have the actual regex for it.

Replies are listed 'Best First'.
Re^3: phrase match
by AnomalousMonk (Bishop) on Dec 13, 2009 at 01:31 UTC

    Here's an approach that seems to satisfy the OPer's (somewhat vaguely expressed, and with the inferred qualifications noted by others, and including a sentence ending with a period) requirements (needs Perl 5.10  \K regex enhancement):

    >perl -wMstrict -le "my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK4A)', 'cell', ); my $delim = qr{ \. | \A \s* | \s+ | \s* \z }xms; my $phrase = join '|', reverse sort map quotemeta, @phrases; my $mark = qq{\x23}; for my $s (@ARGV) { print '--------------'; print $s; $s =~ s{ $delim \K ($phrase) (?= $delim) } {$mark$1$mark}xmsg; print $s; } " "cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6 +." "kinase tor tor SET6" "tor tor SET6 kinase" "tor tor SET6" "kinase tor tor SET6." "tor tor SET6 kinase." "tor tor SET6." "kinase inhibitor" "kinase inhibitor." -------------- cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6. #cell# kinase inhibitor #SET6# activates #p16(INK4A)# in cell-wall #to +r SET6#. -------------- kinase tor tor SET6 kinase #tor# #tor SET6# -------------- tor tor SET6 kinase #tor# #tor SET6# kinase -------------- tor tor SET6 #tor# #tor SET6# -------------- kinase tor tor SET6. kinase #tor# #tor SET6#. -------------- tor tor SET6 kinase. #tor# #tor SET6# kinase. -------------- tor tor SET6. #tor# #tor SET6#. -------------- kinase inhibitor kinase inhibitor -------------- kinase inhibitor. kinase inhibitor.

    (Note:  "\x23" is the  "#" character. Have to do this because of a peculiarity of my command line 'editor'.)

    If Perl version 5.10 is not available, use
        s{ ($delim) ($phrase) (?= $delim) }{$1$mark$2$mark}xmsg;
    as the substitution regex (tested).

    Of course, a lot more testing is recommended!

    The  reverse in the
        my $phrase = join '|',reverse sort map quotemeta, @phrases;
    statement causes the ordered alternation to match the longest phrase substring.

    See also Regexp::Assemble and related modules for other (and perhaps better) ways to compile the  $phrase regex.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://812562]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2021-05-11 07:04 GMT
Find Nodes?
    Voting Booth?
    Perl 7 will be out ...

    Results (114 votes). Check out past polls.