Some SQL engines have the LIMIT BY keywords, eg:
SELECT * FROM really_big_table LIMIT BY 0,25
Which only returns the first 25 elements from the select
call. Subsequent pages can be obtained by increasing the parameters to LIMIT BY. I know Mysql has this feature, dunno on others.
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
| [reply] [Watch: Dir/Any] [d/l] |
| [reply] [Watch: Dir/Any] |
Large search engines also may presearch queries for
single keywords and then combine those when you ask
for multiple keywords. It is possible that those
precompiled indices would be easy to get at when looking
for results n to n+pagesize, but whether pagination goes on
in an SQL engine or a search cache depends on the implementation.
Supposedly if you had a data structure based on fixed
page size, say ten results per tree limb in a tree-type
structure, you could avoid walking nodes which represent
pages before or after the page you want. But often
the user is given an opportunity to change the number of
results per page, or the engine operator may reduce the number
of results returned per click so as to reduce server load.
Much processing power is usually spent on building indices
to the word database. The code which returns matches to
a user just traverses those indices efficiently. Depending
on the amount of data to be searched, you may need to
have a lot more processing done offline, or even use a C
or C++ backend. But for smaller size data stores, Perl's
own regex is quite powerful even without creating an index
at all. One thing to remember is that if you do paging, you
need to be able to guarantee the order of the results will
be the same each time, otherwise you could lose results when
calculating the paging. Databases are not supposed to be
in any particular order, so you need to do sorting
(in SQL, ORDER BY).
One problem you may want to consider if your data is any
significant size, is memory/disk performance. Calling
sort many times can slow you down, and slurping up tons of
disk without smart use of indices is deadly (not as dangerous
if you use even a Perl DBM). The biggest problem might
be just getting too many hits internally, before you have
selected which hits to display for the current page (assuming
you have not run presearches like Google does). I believe
(could be wrong) that most medium size engines do compile
indices but don't actually do presearching and caching of
queries.
An SQL
query could easily fill up your available memory, so if you
were intending to just fetch everything into an
array all at once you definitely need to LIMIT. One way to
save memory (and maybe speed up your search too) if you
are unable to limit the number of results easily might
be to only search for unique row id numbers and then select
a range out of those after sorting them. You might be
able to write a complex SQL query that does so (I'm thinking
of PL/SQL language), or possibly
your engine might automatically optimize and do it in the
most efficient way (seems MySQL might do that). Mysql seems
like one of the fastest ways to index a lot of read-only
data and if you can watch out for filling up memory it might
be a good solution. | [reply] [Watch: Dir/Any] |
It also helps to have the most (likely) relevant hits at the top of list, which means you'll need some kind of ranking system.
This will vary on the type of data returned. You may be able to do a simple sort on a particular column if you are returning results from a dababase.
If it is a more general search, you'll need to do a variety of things, including: more search hits means greater relevance; hits of important words (this is domain-specific) count more than other words; etc. | [reply] [Watch: Dir/Any] |