Database searches with persistent results via CGI

MrCromeDome has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Database searches with persistent results via CGI by JayBonci (Curate) on May 09, 2002 at 20:11 UTC
Good afternoon, MrCromeDome. It seems you are having the same sort of problem as do a lot of search engines... needing to save cycles on many repeat queries. There are a lot of approaches to it, and it really is going to depend on the result set size, what hardware you have it on, etc. I'm going to assume that you have mySql as a database engine, as these tips will hold true for many many DBMs out there. First off, consider this, what is the upper limit size on your result set. Are you looking at 100s, 1000s, more? Because at a certain point, these caching mechanisms are not really going to help you. You'll have to do some performance tweaking to see what size category the bulk your results end up. Secondly, consider how often your data changes? This is essential in determining when your cache should drop dead, and the next search should store new data. If you have rapidly changing data (a good amount entered new each day), consider a cached time of perhaps 4 hours, or do it on off cycles, such as overnight. In mySQL, whenever you request a large result set that has an "ORDER BY" statement in it, mySQL has to at least analyze the entire result set, then apply the limiter to it, to get the correct results. This can be quite tedious on the server, as you often pass over tons of results that you never pass back to the app. One way of caching is to get and hold all of the results that mySQL will come back to you with, and store them someplace fast to look them up in, such as another table. A proposed table would take the form: result_id = primary key int not null dropdead = datetime, a future time in which to kill the result searchstring = char(255) of search results. resultcache = blob / text (or largeblob, etc) Basically, you'd store the "SMITH JOHN" in the search string, and then form your result set in the resultcache, in some way that Storage caching mechanisms can take two forms. Binary column type, Storable, and freeze / thaw This method compresses the best, uncompresses fast, but is binary, and may be harder to debug. A text data type (see the blob link above), Data::Dumper, and Data::Dumper->Dump / eval to store and then revive the data. This approach will store fewer results, is not as fast other approaches, but is easier to debug. If you take a result set, and store it as an array, you can thaw (or eval), the entire result set and jump to a certain spot in the set. To look at this in psuedoperl: `my @list_of_cached_results; if($indatabase){ foreach my $result in ($search_result) { push @list_of_cached_results, result; } my $binary_cache = nfreeze(\@list_of_cached_results); #store lists of binary results } else { @list_of_cached_results = thaw($indatabase); } renderPage(@list_of_cached_results, $offset);` [download] Basically, if you use the perl persistance modules, and a database (such as the one you're already reading from), you can pick up a good amount of speed from picking up a pre-packaged set of results (a match to your search string), and then display it to the page, saving you the hell of finding all the results again. As long as you can manage results and not have them go "stale", then you'll be all set. Hope this made some sense to you. If I don't seem clear, I'd be happy to clarify anything in here. --jb	[reply] [d/l]
Re: Re: Database searches with persistent results via CGI by AidanLee (Chaplain) on May 10, 2002 at 13:04 UTC
Something to consider if drive space is an issue, and the entries returned by the search form all have unique ids is to serialize (by either of the methods JayBonci suggests) just the ids, rather than the entire data structure. I couldn't tell you up front, but it may either be Slower if it takes longer to deserlialize the ids and then call another select statement to get the actual records or faster, since ultimately you'd be transferring less data. First you'd only be transferring just the ids to your script, and selecting the range of ids for the current page of results. then you only have to transfer the full data for those specific records instead of for all results in the search. The other thing I'd suggest is keying the temporary table off an md5 hash or some such of the search parameters, rather than a session (for all the reasons perrin has mentioned). This way you'll still have unique ids but you won't have issues with multiple browsers (or even multiple computers...).	[reply]
Store results by search term, not session ID by perrin (Chancellor) on May 09, 2002 at 20:49 UTC
This is a mistake that many beginning web developers make. In fact, I have recently seen some outrageously paid Java consultants make the same mistake. Users can -- and do -- open multiple browser windows when working with your site. A session ID stored in a cookie is global to all browser windows. This means that if a user has two windows open with different searches going in them, paging trough the results will not work. One search will overwrite the other. The right way to cache results is to do it based on the actual search input. Previous posters discussed how to do it, so I'm just adding a why. Of course you will probably need to find a new search method soon if you're using a SQL "LIKE" clause to implement it, but at least caching will make paging through results reasonably fast.	[reply]
Re: Store results by search term, not session ID by MrCromeDome (Deacon) on May 09, 2002 at 21:21 UTC
If drive space is not an issue on the database server, is there harm in making a temp table in the following format? - search criteria - result column 1 - result column 2 . . . The reason I ask being that my search results will always have the same column names and ordering. And, as for storing the criteria in the table, I imagine I can consolidate it into some sort of string like: type=name;criteria1=this;criteria2=that;criteria3=blah then index that field, consolidate a query into that format, and see if I get a match in the database? If not, cache some more data? Thanks for the help thus far :) MrCromeDome	[reply]
Re: Re: Store results by search term, not session ID by perrin (Chancellor) on May 10, 2002 at 03:38 UTC
Hmmm... I had a reply node here and it seems to have been erased when I replied to ask below in the same thread. Very strange. Anyway, the gist of it was that you can use the table structure you suggested or serialize the whole result data structure into a blob column using Storable. Also, make if your search criteria are in a hash make sure you sort them before you use them as the result key. Hashes can return their keys in a different order, even when the keys are identical.	[reply]
Re: Store results by search term, not session ID by ask (Pilgrim) on May 10, 2002 at 03:11 UTC
I am amazed that noone else suggested that. Sites using session ids to make searches faster sucks. Blah! - ask -- ask bjoern hansen, http://ask.netcetera.dk/ !try; do();	[reply]
Re: Re: Store results by search term, not session ID by perrin (Chancellor) on May 09, 2002 at 21:47 UTC
Yeah, when these Java consultants did that in a search they built for the company I work for, I was amazed. I called them on their mistake and they first claimed it wasn't a problem and then claimed that it was our fault for not specifying in the contract that they had to support multiple browser windows! (The old "spec was bad" defense.) I think the problem in their case came from how much the Java servlet session interface looks like a cache. It's very tempting to just cache things in it rather than build your own proper cache. This is something to be wary of with Apache::Session too. (Incidentally, some of the other posters did mention the idea of using the search terms instead of the session. They just weren't as clear about why.)	[reply]
Re: Database searches with persistent results via CGI by dsheroh (Monsignor) on May 09, 2002 at 19:40 UTC
I can think of two options, offhand: 1) Since you're already going to be storing some session state (mapping the cookie to the temp table's name, if nothing else), just add the last search terms to the retained state. 2) Instead of saving the search terms with the session info, you could also store it with the temp table (maybe even using the search terms as part of the table name, with appropriate precautions). #1 is more straightforward, #2 allows you to reuse the search results for anyone else who does that same search (such as all the monks looking for SMITH JOHN right now). Either way, though, you probably want to maintain this on the server side instead of relying on the client to tell you which set of saved search results to use, since the client can lie to you quite easily.	[reply]
Re: Re: Database searches with persistent results via CGI by Anonymous Monk on May 10, 2002 at 13:50 UTC
I don't see any reason to restrict your cache to a specific user. Isn't it likely that a day when one person searches for "Yasser Arafat" someone else will too? Maybe not, but if this is a feature of your application how about this solution? Every search for a name can bre recorded. The first step when performing a new search is to check if the cache helps. Periodically, a cron job removes the least recently used results. This will work better if what you cache is not the result of a complex query - which is less likely to be reusable - but the result of an atomic search, with a single match criterion. It will be like having a partial index into your database.	[reply]
Re: Re: Re: Database searches with persistent results via CGI by mr_mischief (Monsignor) on May 10, 2002 at 18:26 UTC
This of course is great under certain circumstances. One master table to store the most recent searches, with a value of last searched time, replacing them in LRU fashion is a good solution for this. Each of the records in this table would then point to a results table that would be used as a cache. This would speed things wonderfully for the best case. However, if there is a high rate of insertion into the database, there needs to be a reasonably short timeout on the cache so that a search includes all or nearly all of the matches a fresh search would generate. There are ways around this, too, of course. If one keeps a table of recent insertions, such as the last 50, 500, 5000, or whatever number makes sense, then one can add those to the search easily enough if they match. One could also be more picky and redo the search if the search was done before the oldest of the rows in one's recent insertions table. This shouldn't require any changes to existing tables, and that's a Good Thing(tm). On a similar note, if one keeps with one's cached results table or with the table listing search strings the range over which the last search was made, and can prepare a new search to cover only any additional rows, then that is a major Win(sm). This also shouldn't require changes to existing tables. Christopher E. Stith Do not try to debug the program, for that is impossible. Instead, try only to relaize the truth. There is no bug. Then you will find that it is not the program that bends, but only the list of features.	[reply]
Re: Database searches with persistent results via CGI by Anonymous Monk on May 09, 2002 at 22:52 UTC
I have written a similar search to the one you describe, this is the approach I took. 1. build an indexer which takes all the words used in the search and put them into a table with these columns. word, record_id, word_count, word_in_title word == the word itself record_id == where the word points to word_count == frequency of word in document word_in_title == does word exist in title searching for terms than goes like this( notice that I only put the % at the end of the like so it will still use an index! ) select record_id, sum(word_count), max(word_in_title) from TABLE where word like '$word%' group by record_id if the user searches for multiple words than run the above statement multiple times and sum them up by record_id. Spit them back out by the max word_count. My example site: www.historychannel.com ( does no caching of results, still very quick )	[reply]
Re: Re: Database searches with persistent results via CGI by elwarren (Priest) on May 10, 2002 at 17:23 UTC
This is exactly what I was about to suggest, I love PerlMonks! I would take it one step further and build the table ahead of time. Put all the names in this table and run your search against it. This way you limit the amount of data you chug through inthe search and then just follow the pointer to the bigger records you want to return. So instead of searching 1.5m rows x 20 columns, you'll be searching .90m rows x 2 columns. You could do this with a foreign key or you could do a split($name) and store the results in the first column. This would allow you to remove the LIKE and replace it with a regular match. When a user searches for 'JOHN%' the query would look kind of like this: `select record_id from lookup where name = 'JOHN'` [download] and the table: `NAME RECORD_ID ---- --------- JOHN 1 SMITH 1 BILL 2 SMITH 2 JOHN 3 DEER 3 JOHN 4 DOE 4` [download] and your result set would be 1,3,4. Say the user wanted all of the John Smiths out of the db, your query would look like: `select record_id from lookup where name in ('JOHN', 'SMITH') group by record_id` [download] and your result would be 1, the group by only returns the one match because they are both the same record. Mucho faster than the big table. You can index both columsn and improve your speed again. If you're running Oracle then the name column is an ideal candidate for using a bitmap index. The record_id column would obviously perform better as your normal btree style index. Please please please do not create and drop temporary tables. If it's part of your design then make a table and insert and delete your records out of it. There is much more overhead in creating a table than there is inserting rows. Once you application is designed there should be no reason to be issuing DDL statements in production. Besides, we (the DBAs) hate going in and cleaning out temp tables because we need to figure why the developer left them all over and whether they still need them or not. If they're in your spec then they're exactly where they need to be. HTH!	[reply] [d/l] [select]
Re: Database searches with persistent results via CGI by joealba (Hermit) on May 10, 2002 at 00:52 UTC
You've got some good ideas in your solution, but you may want to attack the weakest part of your search first -- the name search. DBIx::TextIndex might be a bit difficult to implement, but it looks quite powerful. My only other suggestion (aside from the other wise monks' posts) is that you should store your saved search results all in one table with a key id, rather than separate tables for each search. If keyed properly, it should be just about as fast in searching -- and much easier to maintain/purge.	[reply]
Re: Re: Database searches with persistent results via CGI by Anonymous Monk on May 10, 2002 at 12:10 UTC
1.5M records is not that much data. Indeed you should attack the weakness of the problem which is the search. You probably need to make an index on the search columns though. Yes, indexes help with LIKE searches too if the database is sophisticated enough. PostgreSQL is an example of such an RDBMS. You may need to tune other DB parameters to give the backend more resources to work with. If the DB is swapping out to virtual memory alot that makes for a HUGE performance hit. With an RDBMS the goal is to have the table(s) cached in RAM, not read from disk. Perhaps a few more bars of memory and/or some configuration changes and a couple of well place indexes are all you need. You may want to look into the use of "OFFSET" and "LIMIT" clauses in your SQL statement. The backend still has to run the whole query, but you only get the rows you need back. If you are fetching the entire table in Perl just to output the last 25 rows you are wasting a LOT of CPU, let the DB do that for you because it's optimized to do so. If by chance you are on PostgreSQL . . . Have you done a "VACCUM ANALYZE" which allows the query optimizer to gather statistics on past searches which it uses to find better search paths?	[reply]
Re: Database searches with persistent results via CGI by samgold (Scribe) on May 10, 2002 at 20:16 UTC
It would really help to know what database you are using. At the company that I was working at, until I was laid-off :( , we were using an Oracle database and we were dealing with public records. I remember when we were designing a website we were having problems with the performace of doing a search on smith. We ended up using function based indexes and limiting the number of records that were returned. First we ran all of our names through a program that would CASS certify them. Then we used a function called stripstring() which would remove all spaces,', and - and upper case the new string. Then we would index that value using a function based index. When the end user entered the name John Smith we would convert that to JOHNSMITH and then use that value to query the database. We also limited the search to 200 records and it only took a few seconds to return the results. I hope this helps. Sam	[reply]
Re: Database searches with persistent results via CGI by MrCromeDome (Deacon) on May 10, 2002 at 22:26 UTC
Most excellent advice from everyone :) I love this place :) The site when originally developed started with MSSQL7, and has since migrated to MSSQL2K. While I wasn't too impressed with early incarnations of the product, I do have to admit that it's getting better. Would still love to try on Oracle sometime. . . but I digress. Following one suggestion, we tweaked the indexes on the master name table. We have been able to squeeze some more performance out of the initial name search, and are starting to look at the rest of your suggestions on how to best cache and redisplay the results of user queries. Currently, the data set grows once a week by several thousand names, but will likely be updated on a daily basis in the not-so-distant future. Being more forward-thinking this time, we're going to plan for daily updates NOW so something like this doesn't bite us in the rear again ;) Thank you all for the time and effort of responding to this. If my little bit of additional information adds to/changes any suggestions, I'd love to hear those as well. Respectfully (and thankfully!) yours, MrCromeDome	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks