Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Offsite Perlmonks Search Engine

by blakem (Monsignor)
on Jul 07, 2002 at 11:53 UTC ( #179958=monkdiscuss: print w/ replies, xml ) Need Help??

I've put together an experimental perlmonks search engine that is now up and running. It searches node titles not the node contents and currently only contains data from the past few months. Even with those limitations, I hope it will help fill the gap we currently have searching this site.

Depending on how hard it hits my server, I may or may not leave it up permanently.

As always, feedback is more than welcome.

So, what are you waiting for? Go try it out.

-Blake

Comment on Offsite Perlmonks Search Engine
Re: Offsite Perlmonks Search Engine
by RMGir (Prior) on Jul 07, 2002 at 13:45 UTC
    Very nice...

    I hope you can keep this running, it'll be very useful.

    How bad is the load from this? I noticed it works for 3 letter words (my test was "map"), unlike Super Search.

    It would also work for 2 letter words, according to the instructions. But are there any "significant" 2 letter words that someone might want to search for? You can probably lighten your database if you exclude fluff like "to", "as", "do", "in", and "it", for instance...

    Please note in the instructions that this is a "word search"; "grep match" doesn't find Strange grep and matching behaviour....

    The "Terms are split on spaces after non-word chars are stripped" comment probably means that, but it's not completely clear.
    --
    Mike

      Earlier this week, someone complained about not being able to search this site for 'AI' so some two letter words are worth keeping. I do have a very short list of "stopwords" that I can tweak if need be. As far as load to the server... I have no idea... guess I'll find out. ;-)

      I could have the "word search" behavior be optional. The current matching (done in the SQL) is similar to /\b$term\b/ but it would be easy enough to let the user turn off those boundary assertions.

      The "Terms are split on spaces after non-word chars are stripped" is a roundabout way of saying that I'm ignoring quotes. Searching for dogs cats and "perl 6" will get broken down into five terms. dogs, cats, and, perl, 6 '6' gets tossed out because its too short, 'and' is one of the stop words so it is removed as well. That leaves us with dogs, cats, perl and a bunch of bad results. The underscore gives us an easy way out, ala perl_6.

      Thanks for the feedback... I'll probably incorporate the optional "word search" feature in the next rev.

      Update: A partial word matching option has now been implemented...

      -Blake

        The problem is that if you turn off word search, you'll be doing something like "LIKE '%searchItem%'", right? I think that might make your load alot worse...
        --
        Mike
        If you find that searching gets to be a performance bottleneck, one thing you can do is to build custom indices for the pages in the database. You can build an index for each word in the text, and another with word pairs in the text. (This, for example, would have an entry "if you", "you find", "find that" and so on) Searching for phrases is just a matter of splitting the phrase into pairs and searching for documents that match all the pairs. (It's generally good enough)

        You might not have to do this--there's only 180K pages here, so full-text searches may very well not be performance bottlenecks at the moment.

Re: Offsite Perlmonks Search Engine
by greywolf (Priest) on Jul 07, 2002 at 16:32 UTC
    Good work! Hopefully this doesn't become too much of a hassle for you. It will be great to see this completed. Now I can easily find all those PDF nodes I have been looking for.

    mr greywolf

      Did you try PDF? Part of the More HTML escaping work got rid of the "full text search" for simple (title) searches. Right now it only does a strict "and" search on the words entered, but I've just figured out how to correct that without suffering from the "worst case" problems that helped prompt the move to "full text search". I hope to have the enhancements available in the next couple of weeks.

              - tye (but my friends call me "Tye")
        Yes I did search on PDF. But I was using Super Search and I never thought of using the regular search, which gives me what I need. Sometimes I am such a dolt.

        mr greywolf
Re: Offsite Perlmonks Search Engine
by mojotoad (Monsignor) on Jul 08, 2002 at 01:18 UTC
    Is it possible to include some sort of stemming hash to match conjugal word groupings, or classes of similar words?

    For example, searching for "parse" will not match "parsing", etc.

    I know it's extra overhead, but perhaps cached hash results from something like Lingua::Stem or Lingua::EN::Infinitive?

    rob_au sparked an interesting node on stemming earlier this year (Natural Language Index Stemming) that might be helpful as well.

    Thanks for the effort!

    Matt

      Stemming's a rather processor-intensive thing, and a pain to get right in general, but it could work on perlmonks. (Having the advantage of a reasonably small data set and content pretty much exclusively in english) It'd be interesting to give it a shot, though.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://179958]
Approved by rob_au
Front-paged by hsmyers
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2014-07-24 04:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (157 votes), past polls