Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Running SuperSearch off a fast full-text index.

by dmitri (Curate)
on Jun 10, 2007 at 00:17 UTC ( #620228=monkdiscuss: print w/ replies, xml ) Need Help??

Dear Brethren,

do you find SuperSearch slow and unwieldy? That's because the database search is slow. I recently was playing with the most excellent KinoSearch library and discovered that certain things lend themselves well to a full-text search. For instance, I have created an index for searching CPAN's bug database (rt.cpan.org) and a web application to search it. I believe the result speaks for itself.

I am willing to dedicate time to indexing the perlmonks database (just need a place to host it). There are only about 600,000 nodes with a small number of them changing every day; at work, I used KinoSearch to index millions of documents and the searches are very fast and relevant. I believe that this is the type of thing that will a) make searching easier and b) take some load off the database.

Let's discuss? I hope I get the opportunity to contribute to the Monastery.

  - Dmitri.

Update: Track the progress here: http://code.google.com/p/monk-search/

Comment on Running SuperSearch off a fast full-text index.
Re: Running SuperSearch off a fast full-text index.
by educated_foo (Vicar) on Jun 10, 2007 at 02:04 UTC
    Something better than SuperSearch would be nice, but IMHO full-text search on public websites is best done with Google. Just tack a "site:perlmonks.org" onto your query, and you've got a lot of smart people working really hard to give you relevant, up-to-date nodes.

    Perlmonks could make it much better by, for example, disabling the crawling of links into comment threads -- now you get links to "Re^3: blah," "Re^2: blah", etc. rather than just to the root. The robot version could also omit the side-nodes. Doing these things seems much easier than doing your own indexing and search.

      I just tried that -- with TStanley's "Writing a module test" which is a couple days old now and with some much more elderly root nodes of my own.

      No joy.

      So I went to advanced search, asked for the same items with a "from perlmonks.com only" qualifier and exact phrases using root node titles again.

      Again, no joy!

      My little experiment may have been inadequate in many ways, especially the limited sample.

      But it may be that Google's indexing of pm is painfully inadequate. Running the same experiment against some of the sites I maintain turned up no such shortcomings, even when targeting a word used exactly once in a site that's been on line for barely a week.

      And while crawling the links has costs -- for pm and for the user who has to wade thru '"Re^3: blah," "Re^2: blah",' the best material, IMO, tends to be in the replies, rather than the OPs. And wading thru the OPs and their answers, seeking the relevant ones can also be painful for the seeker.

      Just my .02 (two cents, tuppence... does anyone have any non-English equivalent idioms?).
        I agree with the second part... some topics I want to search for may only be mentioned in the Re: nodes.

        Maybe it would be convenient if the search results page had a "link to OP" next to the Re: results?

        ~dewey

      Because we have access to metadata that Google's naive crawler does not, we enjoy certain advantages when building a custom search. Certainly we can offer bells and whistles on the Super Search page that Google's advanced search can't match — they can't do filtering by author, ranking by node reputation, and so on.

      I am confident that our users would find a KinoSearch-based Super Search considerably more usable than the current version, and that this would make them very happy. Programmers like to tweak tweak tweak. :) As a bonus, I also suspect that we can provide simple search results superior to what Google can offer, and certainly better than what we have now. It will be interesting to compare search results before and after we factor node rep into our ranking algorithm.

      Whether or not it is worthwhile to maintain custom indexing and search for a public site depends on the site's size and the demands of its user-base. I expect that with several hundred thousand pages and extremely sophisticated users, we're well past the threshold. My guess is that the time it takes to maintain full-text search, including an advanced search interface, will be fully justified by a collective productivity increase. :)

      SEO improvements to help web search engine spiders should probably be implemented regardless because increasing this site's visibility will aid people seeking answers to Perl questions from outside. However, I understand the powers-that-be have had good reasons for clamping down on spider access, historically.

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com
        A lot of these are obsoleted by a good ranking function, which will tend to pull the best hits to the top even without the additional metadata. For example, a search for "rectangular humphrey" turns up this: "I'm starting to get offers from people who want to sponsor features in my CPAN distro, KinoSearch," which is very relevant -- I didn't realize you were the author of KinoSearch, which you are also suggesting as a platform.

        I agree that node ratings, etc., can be useful, but one of Google's big lessons is that quantity can beat quality: intelligent analysis of huge amounts of generic data can beat analysis of specialized data. This is particularly visible in its approach to natural language translation, but is nearly as important in search.

Re: Running SuperSearch off a fast full-text index.
by creamygoodness (Curate) on Jun 10, 2007 at 16:33 UTC

    dmitri,

    I've long wanted to do exactly what you've proposed, but just haven't found the cycles before now. I would be excited to collaborate with you on it.

    As for hosting, for the time being I can run the app at rectangular.com... and maybe we could set up a repository at code.google.com? ;)

    In addition to the indexer and search applications, we'll need a spidering app that pulls down a local copy of each PerlMonks node. tye has granted permission to spider the site, and suggested the PerlMonks XML node view for getting at the content (see What XML generators are currently available on PerlMonks? for info). Here's an XML rendering of your original post as an example.

    In the initial pull, we'd iterate over each node numerically, probably saving individual XML files to the file system, 1000 nodes per directory. Some nodes will present problems — reaped nodes, for instance — but the responses will always contain sufficient information to dispatch sensibly.

    Keeping the locally mirrored data up-to-date presents some problems, especially with regards to updated text and node rep fluctuations. These problems will be trivial to solve should the service move onto perlmonks.org directly; some of them are solveable even when running remotely, as the total volume of data is not very large. In any case, freshness issues will not have a major impact on the user experience and people will have no trouble making sensible comparisons between the old and the new.

    Once we have a corpus, the indexing and search apps will present familiar challenges for us both. It will be fun to tinker with the ranking algorithms, and I expect that the extremely demanding user base will provide us with lots of high-quality feedback. :)

    What say? Sound like a plan?

    Cheers,

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      Marvin,

      it will be an honor to work with you on this project. What shall we call it?

        - Dmitri.

        How about "MonkSearch", if it's available? Have you ever set up a code.google.com project? I haven't.

        I figure we should set things up like a standard CPAN distro. Most of the code in module files, utility scripts in bin/, yada yada. Sound good? We can call the Perl modules whatever we want at first -- it doesn't matter until there's a public API, which there may never be.

        Are you willing to play the role of lead developer on the project? I believe you're subscribed to the KinoSearch list, so you've seen how collaborations have gone there -- there's often some gory back end stuff that falls to me. For this project, I figure I'll end up spending significant time tweaking the ranking algo in response to user feedback once we have everything in place. In anticipation of that, it would be great if you could take responsibility for most of the high level architecture. I think we'll have a more fruitful collaboration if you own the code and I play a secondary role.

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com
Re: Running SuperSearch off a fast full-text index.
by clinton (Priest) on Jun 10, 2007 at 16:43 UTC
    Or, seeing we already run on MySQl, why not try out the MySQL full text indexes again?

    In a conversation with tye in the CB, he said that PM had already tried using MySQL's full text indexes, and that for searches such as 'perl', the DB just locked up.

    My own experience with MySQL's full text indexes, is that they are very fast and very reliable, on sites with millions of documents. It may be that the version of MySQL that PM is using (which is 3.xx, if I remember correctly) may be somewhat less reliable, and be responsible for these problems.

    I'm assuming that PM has a test suite that we could use to check how a newer version of MySQL works with the existing code.

    I'm betting that upgrading to a newer version and adding full text indexes won't be an enormous task, probably easier than implementing Kinosearch (disclaimer: I have no experience of kinosearch)

    Clint

      I'm assuming that PM has a test suite that we could use to check how a newer version of MySQL works with the existing code.
      Assumptions are dangerous :)


      holli, /regexed monk/
        Ahhh, so as far as writing tests goes, its a case of Do as I say, not as I do... :)
      I heard about this feature of MySQL, but I have never used it. How does it compare to KinoSearch in terms of search results relevance?
        MySQL full text search is simple way to make your tables searchable. However, it doesn't really compare well with specialized full-text indexes which always do better job.

        Aside from not quite stellar performance, it has fixed stop word list and depends upon MySQL which is (IMHO!) one moving part too much to index monetary.

        Sole purpose of this project is full-text search. Let's not go into over-engineering problem: we have slow RDBMS searches right now :-)


        2share!2flame...
      I did a couple of searches for benchmarks comparing MySQL full text search and KinoSearch / Lucene.

      While there wasn't much on Kinosearch (see KinoSearch vs Lucene indexer benchmarks), I did find this comparison between MySQL full text (plus some plugins) and Lucene - see High-Performance-FullText-Search.pdf, which indicates that Lucene is the clear winner in their comparisons.

      Also see this PDF for a nice introdution to KinoSearch.

      One of the things I like about the MySQL full text, is its integration into the main database, so that adding where clauses based on other columns is easy. But reading the benchmarks, this appears to have a significant deleterious effect on performance, so perhaps it isn't such a clever idea after all.

      I see that KinoSearch 0.20 has range searches and filters - I'd be interested in knowing what effect these have on performance.

      Clint

        KinoSearch 0.20's RangeFilters are mostly implemented in C and are optimized for low cost over multiple searches.

        The first time you search with a sort or range constraint on a particular field, there is a hit as a cache has to be loaded. The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI) and can keep the Searcher object around for reuse.

        Once the cache is loaded, RangeFilter is extremely fast. There's an initial burst of disk activity as numerical bounds are found, then the rest is all fetching values from the cache and if (locus < lower_bound) C integer comparison stuff -- no matter how many docs match. There's hardly any overhead added above what's required to match the rest of the query.

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://620228]
Approved by shigetsu
Front-paged by shigetsu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2014-10-02 03:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (45 votes), past polls