Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re^2: phrase match

by Crackers2 (Parson)
on Dec 13, 2009 at 00:49 UTC ( #812562=note: print w/replies, xml ) Need Help??

in reply to Re: phrase match
in thread phrase match

That has two problems:

1) Because you don't capture the space-or-start/end-of-line, the result will be missing some spaces:

kinase inhibitor#SET6#activates#p16(INK4A)#in cell-wall.
This can be fixed by using something like
$sentence =~ s/(^| )($phrases_re)( |$)/$1\#$2\#$3/g;

2) Because the spaces are part of the match, it won't be able to match patterns if they're consecutive in the source string. i.e. if you add 'activates' to the list of phrases, it won't notice it because the space preceding it has been eaten by the match for SET6. Solving this probably involves some simple lookahead/lookbehind logic to grab the spaces instead of actually matching them, but I've never been good at those so I don't have the actual regex for it.

Replies are listed 'Best First'.
Re^3: phrase match
by AnomalousMonk (Chancellor) on Dec 13, 2009 at 01:31 UTC

    Here's an approach that seems to satisfy the OPer's (somewhat vaguely expressed, and with the inferred qualifications noted by others, and including a sentence ending with a period) requirements (needs Perl 5.10  \K regex enhancement):

    >perl -wMstrict -le "my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK4A)', 'cell', ); my $delim = qr{ \. | \A \s* | \s+ | \s* \z }xms; my $phrase = join '|', reverse sort map quotemeta, @phrases; my $mark = qq{\x23}; for my $s (@ARGV) { print '--------------'; print $s; $s =~ s{ $delim \K ($phrase) (?= $delim) } {$mark$1$mark}xmsg; print $s; } " "cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6 +." "kinase tor tor SET6" "tor tor SET6 kinase" "tor tor SET6" "kinase tor tor SET6." "tor tor SET6 kinase." "tor tor SET6." "kinase inhibitor" "kinase inhibitor." -------------- cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6. #cell# kinase inhibitor #SET6# activates #p16(INK4A)# in cell-wall #to +r SET6#. -------------- kinase tor tor SET6 kinase #tor# #tor SET6# -------------- tor tor SET6 kinase #tor# #tor SET6# kinase -------------- tor tor SET6 #tor# #tor SET6# -------------- kinase tor tor SET6. kinase #tor# #tor SET6#. -------------- tor tor SET6 kinase. #tor# #tor SET6# kinase. -------------- tor tor SET6. #tor# #tor SET6#. -------------- kinase inhibitor kinase inhibitor -------------- kinase inhibitor. kinase inhibitor.

    (Note:  "\x23" is the  "#" character. Have to do this because of a peculiarity of my command line 'editor'.)

    If Perl version 5.10 is not available, use
        s{ ($delim) ($phrase) (?= $delim) }{$1$mark$2$mark}xmsg;
    as the substitution regex (tested).

    Of course, a lot more testing is recommended!

    The  reverse in the
        my $phrase = join '|', reverse sort map quotemeta, @phrases;
    statement causes the ordered alternation to match the longest phrase substring.

    See also Regexp::Assemble and related modules for other (and perhaps better) ways to compile the  $phrase regex.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://812562]
[holli]: And I kindly suggest you post a question, instead of spamming the chetterbox
[holli]: my spelling... can you actually aquire dyslexia?
[LanX]: SoPW = Seekers of Perl Wisdom, the section for asking and answering general Perl questions. A form for asking a new Perl question can be found at the bottom of that page. (Be sure to read Posting on PerlMonks before posting!)

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2017-10-23 15:33 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (280 votes). Check out past polls.