Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: Problems searching and highlighting proximity words in a text

by Krambambuli (Deacon)
on May 24, 2010 at 09:20 UTC ( #841348=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Problems searching and highlighting proximity words in a text
in thread Problems searching and highlighting proximity words in a text

If you run your code with perl -Dr (assuming your perl interpreter is compiled with debugging enabled), you'll see what I can see now too:

the regexp engine works and works and works...

However, I cannot see yet exactly what the solution is; at first sight, the regexp seems to be only extremely inefficient via the backtracks when it does _not_ find what it looks for.

Update.

A work-around to avoid the heavy backtracking when the wanted terms are not to be found in the wanted order might look like

if ($content =~ /$par2.*$par1/i) { if ($content =~ /\b($par2)(\W+(?:\w*\W*){1,$distance})?($par1) +\b/i){ warn "IF 2"; my ($par1, $par2, $par3) = ($1, $2, $3); $content =~ s/$par1\Q$par2\E$par3/<$tag$class> $par1<\/$ta +g>$par2<$tag$class> $par3<\/$tag>/gi; } }
That works for me, but I guess there should be some nicer solutions too.

Update2 Looks like using a regexp like
if ($content =~ /\b($par1)(\W+(\w+\W+){0,$distance})($par2)\b/i) {
works OK and also avoids the excessive backtracking for unsuccessful lookups. You'll have however to add an $4 and use it instead of $3 for the extra new match introduced with this.


Comment on Re^3: Problems searching and highlighting proximity words in a text
Select or Download Code
Re^4: Problems searching and highlighting proximity words in a text
by jrc (Initiate) on May 24, 2010 at 11:30 UTC
    Thanks for your solutions seems to work in that example and also and more I try. The $4 seems not to be necessary, at least in my case returns only three results. An example code that works with your suggestions:
      The $4 seems not to be necessary,

      Indeed, as long as you use

      (?:\w+\W+){0,$distance}

      instead of the expression I've used,

      (\w+\W+){0,$distance}

      there will be no extra match. I haven't done any benchmarking, but probably the lookahead is a bit better/faster anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://841348]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2014-08-29 02:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (275 votes), past polls