Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Search Engine function

by Stamp_Guy (Monk)
on Jul 02, 2001 at 04:33 UTC ( [id://93083]=perlquestion: print w/replies, xml ) Need Help??

Stamp_Guy has asked for the wisdom of the Perl Monks concerning the following question:

Ever notice how a search engine returns a limited number of matches at a time (usually 10 or more)? This is really a vague question, I know, but what are some ways to do that?

Replies are listed 'Best First'.
Re: Search Engine function
by Masem (Monsignor) on Jul 02, 2001 at 04:49 UTC
    Some SQL engines have the LIMIT BY keywords, eg:
    SELECT * FROM really_big_table LIMIT BY 0,25
    Which only returns the first 25 elements from the select call. Subsequent pages can be obtained by increasing the parameters to LIMIT BY. I know Mysql has this feature, dunno on others.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Search Engine function
by $code or die (Deacon) on Jul 02, 2001 at 04:52 UTC
Re: Search Engine function
by mattr (Curate) on Jul 02, 2001 at 10:52 UTC
    Large search engines also may presearch queries for single keywords and then combine those when you ask for multiple keywords. It is possible that those precompiled indices would be easy to get at when looking for results n to n+pagesize, but whether pagination goes on in an SQL engine or a search cache depends on the implementation.

    Supposedly if you had a data structure based on fixed page size, say ten results per tree limb in a tree-type structure, you could avoid walking nodes which represent pages before or after the page you want. But often the user is given an opportunity to change the number of results per page, or the engine operator may reduce the number of results returned per click so as to reduce server load.

    Much processing power is usually spent on building indices to the word database. The code which returns matches to a user just traverses those indices efficiently. Depending on the amount of data to be searched, you may need to have a lot more processing done offline, or even use a C or C++ backend. But for smaller size data stores, Perl's own regex is quite powerful even without creating an index at all. One thing to remember is that if you do paging, you need to be able to guarantee the order of the results will be the same each time, otherwise you could lose results when calculating the paging. Databases are not supposed to be in any particular order, so you need to do sorting (in SQL, ORDER BY).

    One problem you may want to consider if your data is any significant size, is memory/disk performance. Calling sort many times can slow you down, and slurping up tons of disk without smart use of indices is deadly (not as dangerous if you use even a Perl DBM). The biggest problem might be just getting too many hits internally, before you have selected which hits to display for the current page (assuming you have not run presearches like Google does). I believe (could be wrong) that most medium size engines do compile indices but don't actually do presearching and caching of queries.

    An SQL query could easily fill up your available memory, so if you were intending to just fetch everything into an array all at once you definitely need to LIMIT. One way to save memory (and maybe speed up your search too) if you are unable to limit the number of results easily might be to only search for unique row id numbers and then select a range out of those after sorting them. You might be able to write a complex SQL query that does so (I'm thinking of PL/SQL language), or possibly your engine might automatically optimize and do it in the most efficient way (seems MySQL might do that). Mysql seems like one of the fastest ways to index a lot of read-only data and if you can watch out for filling up memory it might be a good solution.

Re: Search Engine function
by voyager (Friar) on Jul 02, 2001 at 05:48 UTC
    It also helps to have the most (likely) relevant hits at the top of list, which means you'll need some kind of ranking system.

    This will vary on the type of data returned. You may be able to do a simple sort on a particular column if you are returning results from a dababase.

    If it is a more general search, you'll need to do a variety of things, including: more search hits means greater relevance; hits of important words (this is domain-specific) count more than other words; etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://93083]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-29 12:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found