http://www.perlmonks.org?node_id=332239

sri has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I am currently working on a search engine and now i would like to add some learning capabilities, so that it can learn from the users likes and dislikes.

I studied some books/documentation and on CPAN i found the following modules, but i'm still unsure which way is the right.
- AI::Categorizer
- Mail::SpamAssassin
- Algorithm::NaiveBayes

I know there are lots of possibilities(Bayes, KNN, Decision Trees...), the most popular at the moment seems Naive Bayes.

What would you use and why?

Some hints for good documentation or Perl examples i may have missed would be nice too.

Edited by castaway, fixed closing </a> tags

Replies are listed 'Best First'.
Re: Machine learning
by mce (Curate) on Feb 27, 2004 at 14:47 UTC
    Hi,
    Have a look at PerlFect.

    This is a nice free perl search engine.

    It might give you some ideas.


    ---------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    BMC, Belgium
      Thanks for your tip, but that's not really what i wanted.

      I already reached the state of PerlFect, i even implemented massive storage clustering.

      The key feature of the search engine should be categorization of results and learning from the users input, changed keywords, wants more than first ten results...
      The whole concept of the user interactivity goes far too wide to explain here.

      What i am seeking is simply a good learning algorithm! ;)
        Automatic categorisation is the panacea of Knowledge Management and it is something that a great many people are working on with a view to making some serious financial gain. The auto-cat software is therfore expensive but the blurb on the vendor's websites may be of interest.

        I use search engine software from Verity who implement machine assisted categorisation in a workbench tool such that the output keyword net can be applied to content as it is indexed. This works well in a corporate environment where content doesn't change that much and you just want to locate it in a defined categorisation structure. Verity also have a 'social network' product that allows people to see locate subject matter experts. I haven't worked with this bit yet but the demo looked cool.

        I have also looked at Autonomy who popularised Baysian techniques for clustering results. Their search engine works really well for newsfeeds where the clustering is generally unknown and fluid. The search results can appear really random until the internals have caught up with a new cluster of information. I am told that the BBC News website uses this technique to create the 'related stories' links on it's website.

Re: Machine learning
by ysth (Canon) on Feb 27, 2004 at 16:08 UTC
    Your research already exceeds my knowlege level of the subject, but I'd like to offer some feedback on your goal itself, if I have understood it correctly. It seems to me that having the result of a search depend on not just current but past search criteria is a mistake. Not only should I be able to duplicate a search and get the same results, but I should be able to tell others what I searched for and have them get the same results.

    Perhaps you could show a simple example of a pattern of search/result that demonstrates what kind of learning you are trying for?

    Update: Sorry, makes sense now; don't know how I got the idea that you were talking about per-user or per-session learning.

      Well, not even Google guarantees you the same results over and over again, do you remember the googledance?

      I don't think it's a good idea to always get the same results again, because not everyone searches for the same things, even if they use the same query.

      An overall learning is also planned, not just from session to session, so that everybody benefits from the feedback of others.
      This should not just bring better results but also kill spam.

      This is fiction yet:
      If someone enters a query he gets 10 results back, these results automatically get upvotes in the background, if the user likes them it's ok. Else there is a link on the bottom called "Try again", when the user hits it the ten results get downvoted and he gets another 10 results, maybe some of these were also in the previous results, this depends if there are enougth alternatives.

      There will be no ranking per page but per category, results and queries get both categories.
      In which category the pages and queries fall is absolute dynamic and should change from time to time based on learning.
      They can have more than one category.
      It does not completely rely on user feedback, it also uses traditional link structures and text formatting, but thats far less important than feedback.
      So every new page gets a chance to come up in results quite fast.

      This was just a basic overview, i completely ignored balancing and stuff but i have much more complicated scenarios in my head. :)

      You see i am still in the early stages, but things get clearer every day. ;)
Re: Machine learning
by kvale (Monsignor) on Feb 27, 2004 at 16:24 UTC
    Support Vector Machines are popular in the ML community these days for classification and learning. They work by taking a nonlinear problem and linearizing it in a higher dimension using the 'kernel trick' to allow for fast algorithms for both training and implementation.

    In the perl world, Algorithm::SVM is a perl binding to the libsvm library that handles SVM algorithms.

    -Mark

Re: Machine learning
by prostoalex (Scribe) on Feb 27, 2004 at 17:04 UTC
    You are starting the evaluation of your nails by asking which hammer is right. I'd say pick up a book on artificial intelligence, create a list of algorithms and techniques that interest you and go from there.

    In Artificial Intelligence section of CPAN there's a lot more to choose from, including decision trees, neural network and expert system implementation.

Re: Machine learning
by xiopher (Beadle) on Feb 27, 2004 at 16:24 UTC
    This would be a great thing to add to your personal firewall. You could then have your own personal home page that runs off you firewall.It would sugest the news you are interested in and the slashdot stories you want to comment on. You could also notice that you have spent way too much time surfing perl websites and not enough coding perl.
Re: Machine learning
by Vautrin (Hermit) on Feb 27, 2004 at 16:43 UTC
    Naive Bayesian algorithms are based on probability theory, and need a corpus of documents to "train" them. So, if all you are looking for is whether or not a web page is a good match, that might work (after a few thousand matches you'll have a pretty good accuracy rate). What exactly are you trying to do?

    Want to support the EFF and FSF by buying cool stuff? Click here.