|Perl: the Markov chain saw|
New simple searchby tye (Cardinal)
|on Jul 08, 2002 at 17:34 UTC||Need Help??|
As part of the More HTML Escaping roll-out, the simple search (at the top of each page or via the "node" CGI parameter) was switched away from using MySQL's "full text search" feature. This meant that you could once again search for 3-letter and 2-letter words in node titles. This version avoided the "worst case" situations of the servers sorting through way too many matches but would not find any matches unless all of the words entered matched.
I've just rolled some more improvements into the simple search. The current implementation works like this:
If an exact title match is found (after ignoring nodes that you don't have permission to read unless you have changed your user settings), then no further searching is done.
Otherwise your search string is split on whitespace resulting in a list of "words". We look for nodes that contain the greatest number of your "words" in their titles as simple substrings. Titles that match this maximal number of words are listed, newest first. That is, if you specify 5 words and there are no titles that include 4 or more of your words but there is a title that contains 3 of your words, then you will only be shown titles that contain 3 of your words.
If there are more than 500 such matches, then the oldest 500 are listed (newest first). In future it should change to showing the newest 500 matches but that requires a database change to work around a subtle bug in the MySQL optimizer.
Future changes to the Search results display code will probably reduce clutter by hiding most of the information about replies if a large list of matches was found.
Note that 1-character words must be surrounded by whitespace in the node title for them to match (so / c finds C Client / Perl Server incompatibility and its replies but little else -- note that the ends of titles count as whitespace).
Also, there are no "stop words". A search for perl script takes about the same time as a search for something much more specific.
More flexibility will be available via Super Search when it gets rewritten (hopefully RSN).- tye (but my friends call me "Tye")