Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^3: Running SuperSearch off a fast full-text index.

by creamygoodness (Curate)
on Jun 11, 2007 at 17:44 UTC ( [id://620544]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Running SuperSearch off a fast full-text index.
in thread Running SuperSearch off a fast full-text index.

KinoSearch 0.20's RangeFilters are mostly implemented in C and are optimized for low cost over multiple searches.

The first time you search with a sort or range constraint on a particular field, there is a hit as a cache has to be loaded. The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI) and can keep the Searcher object around for reuse.

Once the cache is loaded, RangeFilter is extremely fast. There's an initial burst of disk activity as numerical bounds are found, then the rest is all fetching values from the cache and if (locus < lower_bound) C integer comparison stuff -- no matter how many docs match. There's hardly any overhead added above what's required to match the rest of the query.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

Replies are listed 'Best First'.
Re^4: Running SuperSearch off a fast full-text index.
by clinton (Priest) on Jun 11, 2007 at 17:53 UTC
    The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI)

    Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no?

    And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines?

    thanks

    Clint

      Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no?

      Yes, and KinoSearch is not thread safe. The memory requirements can be significant for large indexes, even though the data structures are not Perl's and attempts have been made to keep things compact.

      And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines?

      A Searcher instance represents a snapshot of the index in time. Until you manually reload by creating a new Searcher, changes to the index are not visible.

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com
        So maybe a reasonable solution would be:
        • a separate mod_perl search server, which takes search requests from the web server and returns (eg) an XML or Soap list of IDs
        • each child process checks (eg) a last_cache_update file once a minute to decide whether to reload the caches or not

        Clint

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://620544]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2024-03-29 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found