Re: Offsite Perlmonks Search Engine
by RMGir (Prior) on Jul 07, 2002 at 13:45 UTC
|
Very nice...
I hope you can keep this running, it'll be very useful.
How bad is the load from this? I noticed it works for 3 letter words (my test was "map"), unlike Super Search.
It would also work for 2 letter words, according to the instructions. But are there any "significant" 2 letter words that someone might want to search for? You can probably lighten your database if you exclude fluff like "to", "as", "do", "in", and "it", for instance...
Please note in the instructions that this is a "word search"; "grep match" doesn't find Strange grep and matching behaviour....
The "Terms are split on spaces after non-word chars are stripped" comment probably means that, but it's not completely clear.
--
Mike | [reply] |
|
Earlier this week, someone complained about not being able to search this site for 'AI' so some two letter words are worth keeping.
I do have a very short list of "stopwords" that I can tweak if need be. As far as load to the server... I have no idea... guess I'll find out. ;-)
I could have the "word search" behavior be optional. The current matching (done in the SQL) is similar to /\b$term\b/ but it would be easy enough to let the user turn off those boundary assertions.
The "Terms are split on spaces after non-word chars are stripped" is a roundabout way of saying that I'm ignoring quotes. Searching for dogs cats and "perl 6"
will get broken down into five terms. dogs, cats, and, perl, 6 '6' gets tossed out because its too short, 'and' is one of the stop words so it is removed as well. That leaves us with dogs, cats, perl and a bunch of bad results. The underscore gives us an easy way out, ala perl_6.
Thanks for the feedback... I'll probably incorporate the optional "word search" feature in the next rev.
Update: A partial word matching option has now been implemented...
-Blake
| [reply] [d/l] [select] |
|
The problem is that if you turn off word search, you'll be doing something like "LIKE '%searchItem%'", right? I think that might make your load alot worse...
--
Mike
| [reply] |
|
| [reply] |
Re: Offsite Perlmonks Search Engine
by mojotoad (Monsignor) on Jul 08, 2002 at 01:18 UTC
|
Is it possible to include some sort of stemming hash to match conjugal word groupings, or classes of similar words?
For example, searching for "parse" will not match "parsing", etc.
I know it's extra overhead, but perhaps cached hash results from something like Lingua::Stem or Lingua::EN::Infinitive?
rob_au sparked an interesting node on stemming earlier this year (Natural Language Index Stemming) that might be helpful as well.
Thanks for the effort!
Matt
| [reply] |
|
Stemming's a rather processor-intensive thing, and a pain to get right in general, but it could work on perlmonks. (Having the advantage of a reasonably small data set and content pretty much exclusively in english) It'd be interesting to give it a shot, though.
| [reply] |
Re: Offsite Perlmonks Search Engine
by greywolf (Priest) on Jul 07, 2002 at 16:32 UTC
|
Good work! Hopefully this doesn't become too much of a hassle for you. It will be great to see this completed. Now I can easily find all those PDF nodes I have been looking for.
mr greywolf | [reply] |
|
Did you try PDF? Part of the More HTML escaping work got rid of the "full text search" for simple (title) searches. Right now it only does a strict "and" search on the words entered, but I've just figured out how to correct that without suffering from the "worst case" problems that helped prompt the move to "full text search". I hope to have the enhancements available in the next couple of weeks.
- tye (but my friends call me "Tye")
| [reply] |
|
Yes I did search on PDF. But I was using Super Search and I never thought of using the regular search, which gives me what I need. Sometimes I am such a dolt.
mr greywolf
| [reply] |