Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^3: Problems searching and highlighting proximity words in a text

by Krambambuli (Deacon)
on May 24, 2010 at 09:20 UTC ( #841348=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Problems searching and highlighting proximity words in a text
in thread Problems searching and highlighting proximity words in a text

If you run your code with perl -Dr (assuming your perl interpreter is compiled with debugging enabled), you'll see what I can see now too:

the regexp engine works and works and works...

However, I cannot see yet exactly what the solution is; at first sight, the regexp seems to be only extremely inefficient via the backtracks when it does _not_ find what it looks for.

Update.

A work-around to avoid the heavy backtracking when the wanted terms are not to be found in the wanted order might look like

if ($content =~ /$par2.*$par1/i) { if ($content =~ /\b($par2)(\W+(?:\w*\W*){1,$distance})?($par1) +\b/i){ warn "IF 2"; my ($par1, $par2, $par3) = ($1, $2, $3); $content =~ s/$par1\Q$par2\E$par3/<$tag$class> $par1<\/$ta +g>$par2<$tag$class> $par3<\/$tag>/gi; } }
That works for me, but I guess there should be some nicer solutions too.

Update2 Looks like using a regexp like
if ($content =~ /\b($par1)(\W+(\w+\W+){0,$distance})($par2)\b/i) {
works OK and also avoids the excessive backtracking for unsuccessful lookups. You'll have however to add an $4 and use it instead of $3 for the extra new match introduced with this.


Comment on Re^3: Problems searching and highlighting proximity words in a text
Select or Download Code
Re^4: Problems searching and highlighting proximity words in a text
by jrc (Initiate) on May 24, 2010 at 11:30 UTC
    Thanks for your solutions seems to work in that example and also and more I try. The $4 seems not to be necessary, at least in my case returns only three results. An example code that works with your suggestions:
      The $4 seems not to be necessary,

      Indeed, as long as you use

      (?:\w+\W+){0,$distance}

      instead of the expression I've used,

      (\w+\W+){0,$distance}

      there will be no extra match. I haven't done any benchmarking, but probably the lookahead is a bit better/faster anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://841348]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2015-07-06 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (70 votes), past polls