Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Newest Super Search

by tye (Sage)
on Aug 02, 2002 at 21:21 UTC ( [id://187222] : monkdiscuss . print w/replies, xml ) Need Help??

I've finally updated Super Search so it works well again. You can search for 1-, 2-, or 3-letter words again. I have plans for more improvements to it but it is pretty darn powerful right now.

New feature never offered before is to search for replies only within certain sections.

The new search is very careful to not place a lot of load on the database server (a big problem with the previous two versions of Super Search) and to not take too long to return at least some results. So if the server is busy, you'll get your results returned in smaller batches and have to press "Search" again to continue. The more careful you are at composing your search, the more likely you'll get all of your matches in just one (or maybe two) submissions.

I don't really consider it "finished". In particular, the "starting at node ID" field sucks. It is an important feature of how the search is able to work, but a better interface to it for the user is needed. But it was to a point where it is useful (and safe) so I'm releasing it into the wild.

I'd like to support searching "newest nodes first" but the MySQL optimizer will likely fight me on this.

I'd like to support searching user's Scratch Pads, but I'll wait to work on that until after the database changes related to Scratch Pads that will reduce server load and make a lot of other improvements easy (like an "edit" link right on your scratch pad, easier linking to scratch pads, a "d/l code" link for scratch pads, ...).

And some layout improvements and some help pages are also needed.

Have the appropriate amount of fun.</Larry>

Update: I had previously made PM Discussion nodes list before SoPW nodes in Newest Nodes because I was surprised at how often I was running into "regulars" who weren't aware of some recent site change that had been covered in a PM Discussion (and because there are a lot more SoPW nodes per day than PM Discussion nodes). But there haven't been any new PM Discuss nodes in the last few days so this will likely be the first node were people notice this change. If you like, consider it an advanced XP whoring trick instead. ;)

        - tye (but my friends call me "Tye")

Replies are listed 'Best First'.
Re: Newest Super Search
by vladb (Vicar) on Aug 03, 2002 at 00:36 UTC
    For those uninitiated ones, could you tell me which Perl modules are you using to do the search and indexing of data? Whenever I had to do tackle similar search issues, I simply resorted to the use of the DBIx::FullTextSearch module. It's a pretty versatile module and my thinking is it could be used even to implement PM search. This is only a suggestion, however. And also I'm aware that for 2 cents to be worth a dime, I'll have to show some sample code of how this module could be 'incorporated' into the PM engine. ;-)

    # Under Construction

      The only modules are DBI, CGI, and Everything. Most of the actual code is included below.

      This is a pretty specialized situation. For example, we can't allow any single query to run very long since MySQL was designed assuming some aspects of the threading model (light-weight processes) that aren't present on FreeBSD. So a single long-running query can nearly lock up access to the database for everyone.

Re: Newest Super Search
by hossman (Prior) on Aug 03, 2002 at 01:56 UTC
    Super Search seems almost too super now ... i'm definitely overwellmed by all the options.

    the newest/oldest thing is definitely a little weird, especially since it seems like it allways wants to start with node 1 ... even if you are starting with the newest (and even if you pick a high number, it still seems to count up)

    As for Discussions appearing at the top now, I've been wondering why people can't customize which order the sections appear in for themselves? (similar to customizing what order Nodelets appear in)

      Well, I tried to design it so that much of the time you can just fill in what you are searching for in that first box and then hit the button. But there needs to be some major work on how the results are presented after that.

      As I mentioned in the announcement, you can't search "newest first" yet. Those radio buttons are supposed to be "disabled" (they show greyed-out in my browser). I'll probably just remove them until (if) I get that part working. (And the "starting at node ID" field will probably disappear as well.)

      Yes, customizing order on Newest Nodes has been discussed and will probably happen at some point. (:

              - tye (but my friends call me "Tye")

        tye: Any idea if/when the 'Newest First' can/will be implemented? I find that I almost always want newest first, but I can't have it! Will it put too much load on the DB to stick the ORDER BY ??? DESC in there? If it will never be implemented, then at least take away the tantalizing 'greyed-out' Radio Buttons, so I don't have to constantly be reminded that my searches always come back in the wrong order...


Re: Newest Super Search
by dimmesdale (Friar) on Aug 03, 2002 at 19:57 UTC
    Very good.

    I have a suggestion, though. Would it be useful to have some special tag (say <keywords> or <index> or some such) to be able to define a set of keywords to be used in indexing the node. Or maybe you could specify searching just that. I'm thinking of something like in (La)Tex where you can specify words/references to use in the index and it will auto-create one for you.

    What I'm getting at is something like the meta keywords/description allowed on web pages that search engines such as google might use.

    Or maybe there could be a set of certain keywords (e.g., a node might fall under Web -> CGI -> cookies), and then you might be able to have a separate page with links corresponding to those keywords (like a directory). And maybe if some people are interested, editors (or some other group) might have the ability to go around indexing nodes where the author didn't (past nodes, for example) so its easier to search. (this would be something like the Google/Yahoo Directory pages). The keywords would be pre-defined, but might be open to additions?

    Well, these are all just ideas I'm throwing out... maybe none of them are good? It just seems that if someone specifies what a node is about in a concrete way it will return better results than just hoping a certain word appears in it.

Re: Newest Super Search
by danichka (Hermit) on Aug 03, 2002 at 06:29 UTC
    How often are the User nodes cached? Just wondering because I search them when I get bored (which is pretty often).

    I am glad to see a checkbox for PMD too. I don't remember there being one on the old Super Search.

    use Your::Head;

      They aren't ever cached. The newest Super Search searches the live database directly. You can find anything no matter how recently it was added or changed.

              - tye (but my friends call me "Tye")
        Ok, I figured out what my problem was. Super Search will return results for things that are listed as comments in the HTML. Then I wouldn't see that text on the page and thought it was searching a cached version of the User pages. If I had actually taken the time to view the source before I would have realized this a while ago.

        use Your::Head;
Re: Newest Super Search
by tjh (Curate) on Aug 03, 2002 at 16:32 UTC
    This is excellent. Thank you!

    As an aside, I just searched for "IDE" (without quotes), " IDE " (sans quotes), and " IDE " (with quotes). Note the attempts to include leading/trailing spaces. In all cases the results included many (most) pages with no standalone IDE in them, but as word parts.

    I don't know anything about Everything :) so don't know what it looks like in the query, plus, as usual, I can get the data other ways (Google, different terms, etc.) but thought I'd point it out.

    Thanks again! It is very flexible and fast, much appreciated.

      To quote from Super Search:

      Match text containing [______________________________] (seperate strings with [__] -- default is spaces)
      and then, quoting from Super Search results after each of your above search attempts:

      where any text contains all of "IDE"

      where any text contains all of "IDE"

      where any text contains all of """, "IDE", """

      So separating search terms on space (the default) gives us search terms for each of your attempts of ('IDE'), ('', 'IDE', and ''), and ('"', 'IDE', and '"'). But, of course, we ignore empty search terms. That explains the results above. The code that does this is even posted in public1. (:

      So if you want to search on spaces, you have to make it so spaces don't seperate search terms. In the second field just enter any character (or string) that doesn't appear in your single search string.

      There are no "special characters" when searching on strings. Backslashes, spaces, quotes, stars, parens, brackets, ampersands, etc. all just match what you type (case is ignored). You pick the separator you want. The default is space as people are used to listing several space-separated words. Searching for punctuation is important on a Perl site so the search doesn't tie up any characters as being "special" other than the delimiter string you choose.

      As I've already said, some help pages need to be added.

      I also hope to add support for some limited regular expressions. MySQL supports regular expressions and I had initially just assumed that they were rather minimal regular expressions that would be quick to run (this being a database server and all), but they are full-blown regular expressions from Henry Spencer. This means that entering a (perhaps intentionally) "bad" regular expression could burn a lot of SQL daemon CPU time (the strings being matched against are node contents which can be quite long leaving lots of room for pathological backtracking to get way out of hand).

      So to support regular expressions I'm going to have to pick a subset of regular expression features (or perhaps some other wildcarding scheme such as what glob uses, ...) that I can translate to MySQL regular expressions so that I can be sure that no one can enter a "pattern" that locks up the site.

      I don't know anything about Everything :) so don't know what it looks like in the query
      But that is easy enough to find out. Simply "view source" on the results page and search for "SELECT" and there you'll see (one of) the query all spelled out in full SQL. :)

      (Minor updates applied.)

              - tye (but my friends call me "Tye")

      1 Leading and trailing spaces are stripped from the separator string (but not from the search string). And an empty separator string is interpretted as being a single space instead. This is because spaces don't really "show" on the web page.

      So you can't set your separator to be, for example ", " nor " + ", but you can search on those strings so long as your separator string doesn't conflict.

        Perfect. I changed the separator to something else and searched for the <space>EDI<space> string and got quick and accurate responses.

        Thanks some more.

Re: Newest Super Search
by blakem (Monsignor) on Oct 12, 2002 at 08:56 UTC
    I didn't find this documented anywhere, so....

    In Super Search you can specify authors by their homenode id using the id://83485 form. This is the only way I found to distinguish between io and I0.


      The former is eye oh and the latter is eye zero.

              - tye (not to be confused with Τу℮)
Re: Newest Super Search
by schumi (Hermit) on Aug 15, 2002 at 09:40 UTC
    I bow before thy work!

    I have a humble question, though. I just searched for something, got the first results, decided they looked promising, but wanted to go on searching. So I pressed "Next". For the next two search-parts (legs? attempts? How do you call these smaller batches?) I didn't get any results, only for the third and last attempt. These results looked promising as well, but - my initial results had gone.

    Is there an easy way in which the results of the part-seraches can be kept? If not, one could always help onself by opening the results in new browser-windows/-tabs before hitting "Next" again.


    There are nights when the wolves are silent and only the moon howls. - George Carlin