Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Fulltext DB search: The Need for Speed

by jest (Pilgrim)
on Oct 27, 2003 at 15:32 UTC ( #302408=perlquestion: print w/replies, xml ) Need Help??

jest has asked for the wisdom of the Perl Monks concerning the following question:

I have a database app running on MySQL. Most of it is running very nicely, but my big problem is that I need to execute fast fulltext searches against one of the tables, and the built-in MySQL fulltext capability just isn't fast enough for common words.

The table in question has about 3M rows, and the relevant field averages maybe 20-30 words. It's perfectly fine when searching for not-extremely-common words, but I need to be able to search on anything (I've changed the minimum-word-length to 2 characters, and eliminated the stopword list). Doing searches for phrases like "the thing is" or "he said" can take many seconds or even minutes, and this is not an artificial need, but a common search that the app must be able to handle.

Most of the things I've looked at for fulltext searches are concerned with things like stemming, or fuzzy matching, or result ranking. I'm interested in none of those things, I only want an exact result quickly. I tried for example DBIx::FullTextSearch, but, while superficially appealing, this (1) took 250 times longer to build the indexes than in the MySQL built-in, and (2) didn't really run any faster (it was hard to benchmark, but if there was a difference it was slight).

I also looked at the Vector Space search engine, and the Perlmonks discussion of it, and while this looks really cool, it didn't seem as if that was especially fast either, and besides I don't have the math chops to really get this working.

I haven't tried the recent perl.com article on Adding Search Functionality to Perl Applications, but it seems that that's geared more towards stemming and ranking than in raw speed.

Does anyone have a suggestion? I know most people aren't trying to match "in it", but some people are, and Google can do it quickly, and I hope there's a way of accomplishing this.

Replies are listed 'Best First'.
Re: Fulltext DB search: The Need for Speed
by kvale (Monsignor) on Oct 27, 2003 at 18:47 UTC
    In addition to Abigail's excellent advice, heuristics can be employed to speed things up.

    One heuristic is to create an internal stoplist of common words and look up only non-stoplisted words first. Then from that small set of matching records, search for the stoplisted words. Set intersection is associative -- use that to your advantage.

    Another heuristic is to use caching. If there are a relatively small number of very common and commonly requested words, create a table of matching records for each word and search only that table if the word is in the phrase. This prevents having to pull up most of a table each time such a word appears.

    -Mark

Re: Fulltext DB search: The Need for Speed
by Abigail-II (Bishop) on Oct 27, 2003 at 15:57 UTC
    Well, this isn't really a Perl question, but more a question about of to organize your data. How Google does it is a secret, but I can tell you that Google isn't doing a full search against the entire web.

    But I can speculate. My guess is that Google has a huge index. An index on words. If it fetches a (new) web page, it makes a list of all the words occurring in the page. For each word, it stores a pointer to said page in the index of words it keeps (including pointer(s) where in the document the word(s) are found). So, if you search for "the brown cat with a glass eye", it will toss out the common words 'the' and 'a' (and perhaps 'with' as well). For 'brown', 'cat', 'glass', and 'eye' (and perhaps 'with'), it gets the pointers to the pages the words are found in. For pages containing all words, you need to take the intersection of the different (sub)results.

    Of course, in reality Google will do it much smarter, perhaps not just indexing on single words, but on word pairs or triples, or by using a multilevel index.

    But the important point is, if you want to search on words, and you want to search fast, you got to index on words, and not use full text searches. And even if you want to only returns documents that have "the brown cat with a glass eye" right next to each other, it's a huge win if you limit your full text search to those documents that contain the words 'brown', 'cat', 'glass' and 'eye'.

    Abigail

      Thanks. I should clarify that the MySQL built-in fulltext search I am currently using is an index-based search; I'm not merely doing full table scans to match "WHERE column LIKE '%$match%'" or something like that. And these other solutions I've tried, or am considering trying, are likewise building up an index of individual words.

Re: Fulltext DB search: The Need for Speed
by perrin (Chancellor) on Oct 27, 2003 at 17:42 UTC
    It sounds like you need a multi-pass search. Change your indexing options so that 2 letter words and stopwords are not indexed. Then you can do a really fast search for all documents containing the most significant search words, ignoring whether or not they are in a phrase. This is the "qualifiying" round. Next, do a LIKE '%phrase%' search on the rows that matched the first query. This should be a fairly small set, so it will be fast.

    If you really have to search for phrases like "in it" quickly, you may have to look into different kinds of indexes and probably read some search engine theory. The Algorithms in Perl book from O'Reilly might be a good starting place.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://302408]
Approved by traveler
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2022-01-24 10:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (64 votes). Check out past polls.

    Notices?