Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Text::Context or alternatives?

by davido (Cardinal)
on Nov 08, 2011 at 17:12 UTC ( [id://936829]=note: print w/replies, xml ) Need Help??


in reply to Text::Context or alternatives?

This seems to provide more context (no pun intended) for the code snippet you showed:

for my $word (@{ $self->{keywords} }) { my $word_score = 0; $word_score += 1 + ($content =~ tr/ / /) if $content =~ /\b\Q$ +word\E\b/i; $matches{$word} = $word_score; }

That seems to be iterating over the list of keywords, and calculating a score per keyword.

It might be that the same could be accomplished with greater efficiency if the algorithm were turned onto the words in $content rather than the keywords, and then determine if each word in $content matches a keyword from the hash. If so, then apply the tr/// count.


Dave

Replies are listed 'Best First'.
Re^2: Text::Context or alternatives?
by moritz (Cardinal) on Nov 09, 2011 at 08:41 UTC

    It looks to me as if $content =~ tr/ / / could be calculated once outside the loop, and be kept in a variable.

    Still I find the number of blanks to be a rather dubious metric (what about all those other whitespace characters? Do two blanks in a row still make sense to count double?)

      Yes, I too think it could be calculated outside the loop. But what merit does any statistic of the paragraph have as a score for a match?

        So in summary after doing some more spelunking, I think that Text::Context is buggy and probably unfinished. But nobody's suggested any alternatives. So I'll probably just bodge it until it works for me. I'll post the results to its RT queue.

Re^2: Text::Context or alternatives?
by Dave Howorth (Scribe) on Nov 08, 2011 at 17:21 UTC

    Yes, I didn't provide the context because I suppose monks will have their own ideas about how much is relevant.

    My question is rather, what relevance do the number of words in the paragraph (i.e. 1 + the tr///) have to do with a meaningful score?

    It's now occurred to me that perhaps that should read

    ($word =~ tr/ / /)

      Oh, I thought that part was made obvious in the documentation of the source code:

      "Now we want to find a "score" for this paragraph, finding the best set of keywords which "apply" to it. We favour keyword sets which have a large number of matches (obviously a paragraph is better if it matches "a" and "c" than if it just matches "a") and with multi-word keywords. (A paragraph which matches "fresh cheese sandwiches" en bloc is worth picking out, even if it has no other matches.)"

      It seems the intent is to find out how powerful the keyword is within a given paragraph. More matches means a better fit, more relevancy.

      And on second thought, there's really nothing to be gained by turning the algorithm on its side. It's utilizing Perl's strengths already.

      If speed is of concern, profile and find where the bottleneck is. Tom Duff (of Duff's Device) said this:

      "If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles."

      Step one: Figure out where the trouble really is (profile). Step two, try to devise a better algorithm for that particular segment of code. Step three (if two fails): Remove cycles. That may be easier said than done, but unless you're already certain this particular loop is your problem we can't be sure.

      The source code for the module itself gives a clue immediately following that loop:

      #XXX : Possible optimization: Give up if there are no matches


      Dave

        davido wrote:

        "Oh, I thought that part was made obvious in the documentation of the source code:

        "Now we want to find a "score" for this paragraph, finding the best set of keywords which "apply" to it. We favour keyword sets which have a large number of matches (obviously a paragraph is better if it matches "a" and "c" than if it just matches "a") and with multi-word keywords. (A paragraph which matches "fresh cheese sandwiches" en bloc is worth picking out, even if it has no other matches.)"

        It seems the intent is to find out how powerful the keyword is within a given paragraph. More matches means a better fit, more relevancy."

        That's where I have trouble understanding. How does the number of words in the paragraph have anything to do with the quality of the match? It seems to me like the documentation and implied intent don't match the code. If you think its correct, can you explain what it does using different words perhaps?

        If speed is of concern, profile and find where the bottleneck is.

        Indeed, but it's correctness rather than performance that concern me, though the performance got me started investigating. I posted a summary of my NYTProf results to its RT queue a few days ago.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://936829]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2024-04-24 21:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found