Offsite Perlmonks Search Engine

I've put together an experimental perlmonks search engine that is now up and running. It searches node titles not the node contents and currently only contains data from the past few months. Even with those limitations, I hope it will help fill the gap we currently have searching this site.

Depending on how hard it hits my server, I may or may not leave it up permanently.

As always, feedback is more than welcome.

So, what are you waiting for? Go try it out.

-Blake

Comment on Offsite Perlmonks Search Engine

Replies are listed 'Best First'.
Re: Offsite Perlmonks Search Engine by RMGir (Prior) on Jul 07, 2002 at 13:45 UTC
Very nice... I hope you can keep this running, it'll be very useful. How bad is the load from this? I noticed it works for 3 letter words (my test was "map"), unlike Super Search. It would also work for 2 letter words, according to the instructions. But are there any "significant" 2 letter words that someone might want to search for? You can probably lighten your database if you exclude fluff like "to", "as", "do", "in", and "it", for instance... Please note in the instructions that this is a "word search"; "grep match" doesn't find Strange grep and matching behaviour.... The "Terms are split on spaces after non-word chars are stripped" comment probably means that, but it's not completely clear. -- Mike	[reply]
Re: Re: Offsite Perlmonks Search Engine by blakem (Monsignor) on Jul 07, 2002 at 14:39 UTC
Earlier this week, someone complained about not being able to search this site for 'AI' so some two letter words are worth keeping. I do have a very short list of "stopwords" that I can tweak if need be. As far as load to the server... I have no idea... guess I'll find out. ;-) I could have the "word search" behavior be optional. The current matching (done in the SQL) is similar to /\b$term\b/ but it would be easy enough to let the user turn off those boundary assertions. The "Terms are split on spaces after non-word chars are stripped" is a roundabout way of saying that I'm ignoring quotes. Searching for `dogs cats and "perl 6"` will get broken down into five terms. `dogs, cats, and, perl, 6` '6' gets tossed out because its too short, 'and' is one of the stop words so it is removed as well. That leaves us with `dogs, cats, perl` and a bunch of bad results. The underscore gives us an easy way out, ala `perl_6`. Thanks for the feedback... I'll probably incorporate the optional "word search" feature in the next rev. Update: A partial word matching option has now been implemented... -Blake	[reply] [d/l] [select]
Re: Re: Re: Offsite Perlmonks Search Engine by RMGir (Prior) on Jul 07, 2002 at 14:43 UTC
The problem is that if you turn off word search, you'll be doing something like "LIKE '%searchItem%'", right? I think that might make your load alot worse... -- Mike	[reply]
Re: Re: Re: Offsite Perlmonks Search Engine by Elian (Parson) on Jul 08, 2002 at 01:34 UTC
If you find that searching gets to be a performance bottleneck, one thing you can do is to build custom indices for the pages in the database. You can build an index for each word in the text, and another with word pairs in the text. (This, for example, would have an entry "if you", "you find", "find that" and so on) Searching for phrases is just a matter of splitting the phrase into pairs and searching for documents that match all the pairs. (It's generally good enough) You might not have to do this--there's only 180K pages here, so full-text searches may very well not be performance bottlenecks at the moment.	[reply]
Re: Offsite Perlmonks Search Engine by mojotoad (Monsignor) on Jul 08, 2002 at 01:18 UTC
Is it possible to include some sort of stemming hash to match conjugal word groupings, or classes of similar words? For example, searching for "parse" will not match "parsing", etc. I know it's extra overhead, but perhaps cached hash results from something like Lingua::Stem or Lingua::EN::Infinitive? rob_au sparked an interesting node on stemming earlier this year (Natural Language Index Stemming) that might be helpful as well. Thanks for the effort! Matt	[reply]
Re: Re: Offsite Perlmonks Search Engine by Elian (Parson) on Jul 08, 2002 at 01:20 UTC
Stemming's a rather processor-intensive thing, and a pain to get right in general, but it could work on perlmonks. (Having the advantage of a reasonably small data set and content pretty much exclusively in english) It'd be interesting to give it a shot, though.	[reply]
Re: Offsite Perlmonks Search Engine by greywolf (Priest) on Jul 07, 2002 at 16:32 UTC
Good work! Hopefully this doesn't become too much of a hassle for you. It will be great to see this completed. Now I can easily find all those PDF nodes I have been looking for. mr greywolf	[reply]
(tye)Re: Offsite Perlmonks Search Engine by tye (Sage) on Jul 07, 2002 at 17:36 UTC
Did you try PDF? Part of the More HTML escaping work got rid of the "full text search" for simple (title) searches. Right now it only does a strict "and" search on the words entered, but I've just figured out how to correct that without suffering from the "worst case" problems that helped prompt the move to "full text search". I hope to have the enhancements available in the next couple of weeks. - tye (but my friends call me "Tye")	[reply]
Re:Re: Offsite Perlmonks Search Engine by greywolf (Priest) on Jul 08, 2002 at 17:49 UTC
Yes I did search on PDF. But I was using Super Search and I never thought of using the regular search, which gives me what I need. Sometimes I am such a dolt. mr greywolf	[reply]

Back to Perl Monks Discussion