Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Just another Perl shrine
 
PerlMonks  

Database searches with persistent results via CGI

by MrCromeDome (Deacon)
on May 09, 2002 at 19:23 UTC ( #165472=perlquestion: print w/ replies, xml ) Need Help??
MrCromeDome has asked for the wisdom of the Perl Monks concerning the following question:

I've got myself into a bit of a predicament ;) A web search that I wrote some time ago works fine in most instances, but my name searches are taking an eternity (as I allow for some flexibility in searching, I'm forced to use an SQL LIKE statement). The results are paginated, so subsequent searches on the same criteria must perform the same search for each page of data displayed. Obviously, this is not terribly efficent, and has become less so with 1.5 million names now in the database. I would like to remedy this.

Here's what I have cooked up currently:

Currently, there are no logins or stateful mechanisms in place. I figure that if I introduce state, I can then build temp tables in the database with the results of certain queries. While the awful name searches will still be awful, they'll only have to be performed once, and the code that displays the results can work off of the temp table.

The session table I have devised looks something like this:
- session_id
- last_updated
- result_temp_table

I figured the following process may work over the course of a typical session:
- User fills in search criteria, hits submit.
- If we have a cookie for this already, make sure the session still exists. If either of these fail, generate a new session ID, cookie, and table entry.
- If the named temp table exists, and the search criteria hasn't changed from the last search, display results from the temp table.
- If the search criteria has changed, drop the temp table (if it exists). Create a new table with a new result set.

Sessions would be deleted every 24 hours or so.

The other part of my problem is with how the search was designed. The first time a search is performed, only a form is shown. When criteria is given and the user presses submit, the criteria are submitted to the same script. . . the first page of results is shown, followed by the search form again (with the criteria filled back in). At the bottom of the result table are links (not buttons) to each of the following pages. They pass certain parameters back to the query search to perform a "fake" submit of sorts. (to get an idea of what I'm talking about, go to http://www.stclaircountyrecorder.com, click Begin Search, and do a name search for SMITH JOHN).

In order to make this all work, I need to have some way of determining if the search criteria has changed from the last time the script was run. Seeing as how I already have a fake submit mechanism in place, I figured the easiest way of accomplishing this is to tack an extra parameter on to the page navigation links. The parameter would be absent if the user pressed the submit button, and a new temp table would be created.

My question to you all is then: does this sound reasonable? If not, what have I missed in my design? While it all seems sound to me, I'm still a relative newcomer to web development, and this is the first time I've needed to introduce state into one of my projects. While I normally believe in trying different things until I find the best way of doing it, I'm a bit short on time this time around, and don't think I'll have the time to re-engineer things if I make a critical mistake early on.

As always, your help and insights are appreciated. Thanks in advance,
MrCromeDome

Comment on Database searches with persistent results via CGI
Re: Database searches with persistent results via CGI
by dsheroh (Parson) on May 09, 2002 at 19:40 UTC
    I can think of two options, offhand:

    1) Since you're already going to be storing some session state (mapping the cookie to the temp table's name, if nothing else), just add the last search terms to the retained state.

    2) Instead of saving the search terms with the session info, you could also store it with the temp table (maybe even using the search terms as part of the table name, with appropriate precautions).

    #1 is more straightforward, #2 allows you to reuse the search results for anyone else who does that same search (such as all the monks looking for SMITH JOHN right now). Either way, though, you probably want to maintain this on the server side instead of relying on the client to tell you which set of saved search results to use, since the client can lie to you quite easily.

      I don't see any reason to restrict your cache to a specific user. Isn't it likely that a day when one person searches for "Yasser Arafat" someone else will too? Maybe not, but if this is a feature of your application how about this solution? Every search for a name can bre recorded. The first step when performing a new search is to check if the cache helps. Periodically, a cron job removes the least recently used results. This will work better if what you cache is not the result of a complex query - which is less likely to be reusable - but the result of an atomic search, with a single match criterion. It will be like having a partial index into your database.
        This of course is great under certain circumstances. One master table to store the most recent searches, with a value of last searched time, replacing them in LRU fashion is a good solution for this. Each of the records in this table would then point to a results table that would be used as a cache. This would speed things wonderfully for the best case.

        However, if there is a high rate of insertion into the database, there needs to be a reasonably short timeout on the cache so that a search includes all or nearly all of the matches a fresh search would generate. There are ways around this, too, of course.

        If one keeps a table of recent insertions, such as the last 50, 500, 5000, or whatever number makes sense, then one can add those to the search easily enough if they match. One could also be more picky and redo the search if the search was done before the oldest of the rows in one's recent insertions table. This shouldn't require any changes to existing tables, and that's a Good Thing(tm).

        On a similar note, if one keeps with one's cached results table or with the table listing search strings the range over which the last search was made, and can prepare a new search to cover only any additional rows, then that is a major Win(sm). This also shouldn't require changes to existing tables.

        Christopher E. Stith
        Do not try to debug the program, for that is impossible. Instead, try only to relaize the truth. There is no bug. Then you will find that it is not the program that bends, but only the list of features.
Re: Database searches with persistent results via CGI
by JayBonci (Curate) on May 09, 2002 at 20:11 UTC
    Good afternoon, MrCromeDome. It seems you are having the same sort of problem as do a lot of search engines... needing to save cycles on many repeat queries. There are a lot of approaches to it, and it really is going to depend on the result set size, what hardware you have it on, etc. I'm going to assume that you have mySql as a database engine, as these tips will hold true for many many DBMs out there.

    1. First off, consider this, what is the upper limit size on your result set. Are you looking at 100s, 1000s, more? Because at a certain point, these caching mechanisms are not really going to help you. You'll have to do some performance tweaking to see what size category the bulk your results end up.
    2. Secondly, consider how often your data changes? This is essential in determining when your cache should drop dead, and the next search should store new data. If you have rapidly changing data (a good amount entered new each day), consider a cached time of perhaps 4 hours, or do it on off cycles, such as overnight.
    In mySQL, whenever you request a large result set that has an "ORDER BY" statement in it, mySQL has to at least analyze the entire result set, then apply the limiter to it, to get the correct results. This can be quite tedious on the server, as you often pass over tons of results that you never pass back to the app. One way of caching is to get and hold all of the results that mySQL will come back to you with, and store them someplace fast to look them up in, such as another table.

    A proposed table would take the form:
    • result_id = primary key int not null
    • dropdead = datetime, a future time in which to kill the result
    • searchstring = char(255) of search results.
    • resultcache = blob / text (or largeblob, etc)
    Basically, you'd store the "SMITH JOHN" in the search string, and then form your result set in the resultcache, in some way that

    Storage caching mechanisms can take two forms.
    • Binary column type, Storable, and freeze / thaw This method compresses the best, uncompresses fast, but is binary, and may be harder to debug.
    • A text data type (see the blob link above), Data::Dumper, and Data::Dumper->Dump / eval to store and then revive the data. This approach will store fewer results, is not as fast other approaches, but is easier to debug.
    If you take a result set, and store it as an array, you can thaw (or eval), the entire result set and jump to a certain spot in the set. To look at this in psuedoperl:
    my @list_of_cached_results; if($indatabase){ foreach my $result in ($search_result) { push @list_of_cached_results, result; } my $binary_cache = nfreeze(\@list_of_cached_results); #store lists of binary results } else { @list_of_cached_results = thaw($indatabase); } renderPage(@list_of_cached_results, $offset);
    Basically, if you use the perl persistance modules, and a database (such as the one you're already reading from), you can pick up a good amount of speed from picking up a pre-packaged set of results (a match to your search string), and then display it to the page, saving you the hell of finding all the results again. As long as you can manage results and not have them go "stale", then you'll be all set.

    Hope this made some sense to you. If I don't seem clear, I'd be happy to clarify anything in here.

        --jb

      Something to consider if drive space is an issue, and the entries returned by the search form all have unique ids is to serialize (by either of the methods JayBonci suggests) just the ids, rather than the entire data structure. I couldn't tell you up front, but it may either be

      • Slower if it takes longer to deserlialize the ids and then call another select statement to get the actual records
      • or faster, since ultimately you'd be transferring less data. First you'd only be transferring just the ids to your script, and selecting the range of ids for the current page of results. then you only have to transfer the full data for those specific records instead of for all results in the search.

      The other thing I'd suggest is keying the temporary table off an md5 hash or some such of the search parameters, rather than a session (for all the reasons perrin has mentioned). This way you'll still have unique ids but you won't have issues with multiple browsers (or even multiple computers...).

Store results by search term, not session ID
by perrin (Chancellor) on May 09, 2002 at 20:49 UTC
    This is a mistake that many beginning web developers make. In fact, I have recently seen some outrageously paid Java consultants make the same mistake.

    Users can -- and do -- open multiple browser windows when working with your site. A session ID stored in a cookie is global to all browser windows. This means that if a user has two windows open with different searches going in them, paging trough the results will not work. One search will overwrite the other.

    The right way to cache results is to do it based on the actual search input. Previous posters discussed how to do it, so I'm just adding a why.

    Of course you will probably need to find a new search method soon if you're using a SQL "LIKE" clause to implement it, but at least caching will make paging through results reasonably fast.

      If drive space is not an issue on the database server, is there harm in making a temp table in the following format?

      - search criteria
      - result column 1
      - result column 2
      . . .

      The reason I ask being that my search results will always have the same column names and ordering.

      And, as for storing the criteria in the table, I imagine I can consolidate it into some sort of string like:
      type=name;criteria1=this;criteria2=that;criteria3=blah

      then index that field, consolidate a query into that format, and see if I get a match in the database? If not, cache some more data?

      Thanks for the help thus far :)
      MrCromeDome

        Hmmm... I had a reply node here and it seems to have been erased when I replied to ask below in the same thread. Very strange.

        Anyway, the gist of it was that you can use the table structure you suggested or serialize the whole result data structure into a blob column using Storable.

        Also, make if your search criteria are in a hash make sure you sort them before you use them as the result key. Hashes can return their keys in a different order, even when the keys are identical.

      I am amazed that noone else suggested that. Sites using session ids to make searches faster sucks. Blah!

       - ask

      -- 
      ask bjoern hansen, http://ask.netcetera.dk/   !try; do();
      
        Yeah, when these Java consultants did that in a search they built for the company I work for, I was amazed. I called them on their mistake and they first claimed it wasn't a problem and then claimed that it was our fault for not specifying in the contract that they had to support multiple browser windows! (The old "spec was bad" defense.)

        I think the problem in their case came from how much the Java servlet session interface looks like a cache. It's very tempting to just cache things in it rather than build your own proper cache. This is something to be wary of with Apache::Session too.

        (Incidentally, some of the other posters did mention the idea of using the search terms instead of the session. They just weren't as clear about why.)

Re: Database searches with persistent results via CGI
by Anonymous Monk on May 09, 2002 at 22:52 UTC
    I have written a similar search to the one you describe, this is the approach I took.

    1. build an indexer which takes all the words used in the search and put them into a table with these columns.

    word, record_id, word_count, word_in_title

    word == the word itself
    record_id == where the word points to
    word_count == frequency of word in document
    word_in_title == does word exist in title

    searching for terms than goes like this( notice that I only put the % at the end of the like so it will still use an index! )

    select record_id, sum(word_count), max(word_in_title) from TABLE where word like '$word%' group by record_id

    if the user searches for multiple words than run the above statement multiple times and sum them up by record_id.

    Spit them back out by the max word_count.

    My example site: www.historychannel.com ( does no caching of results, still very quick )

      This is exactly what I was about to suggest, I love PerlMonks!

      I would take it one step further and build the table ahead of time. Put all the names in this table and run your search against it. This way you limit the amount of data you chug through inthe search and then just follow the pointer to the bigger records you want to return. So instead of searching 1.5m rows x 20 columns, you'll be searching .90m rows x 2 columns.

      You could do this with a foreign key or you could do a split($name) and store the results in the first column. This would allow you to remove the LIKE and replace it with a regular match. When a user searches for 'JOHN%' the query would look kind of like this:
      select record_id from lookup where name = 'JOHN'
      and the table:
      NAME RECORD_ID ---- --------- JOHN 1 SMITH 1 BILL 2 SMITH 2 JOHN 3 DEER 3 JOHN 4 DOE 4
      and your result set would be 1,3,4. Say the user wanted all of the John Smiths out of the db, your query would look like:
      select record_id from lookup where name in ('JOHN', 'SMITH') group by record_id
      and your result would be 1, the group by only returns the one match because they are both the same record.

      Mucho faster than the big table. You can index both columsn and improve your speed again. If you're running Oracle then the name column is an ideal candidate for using a bitmap index. The record_id column would obviously perform better as your normal btree style index.

      Please please please do not create and drop temporary tables. If it's part of your design then make a table and insert and delete your records out of it. There is much more overhead in creating a table than there is inserting rows. Once you application is designed there should be no reason to be issuing DDL statements in production. Besides, we (the DBAs) hate going in and cleaning out temp tables because we need to figure why the developer left them all over and whether they still need them or not. If they're in your spec then they're exactly where they need to be.

      HTH!
Re: Database searches with persistent results via CGI
by joealba (Hermit) on May 10, 2002 at 00:52 UTC
    You've got some good ideas in your solution, but you may want to attack the weakest part of your search first -- the name search. DBIx::TextIndex might be a bit difficult to implement, but it looks quite powerful.

    My only other suggestion (aside from the other wise monks' posts) is that you should store your saved search results all in one table with a key id, rather than separate tables for each search. If keyed properly, it should be just about as fast in searching -- and much easier to maintain/purge.
      1.5M records is not *that* much data. Indeed you should attack the weakness of the problem which is the search.

      You probably need to make an index on the search columns though. Yes, indexes help with LIKE searches too if the database is sophisticated enough. PostgreSQL is an example of such an RDBMS. You may need to tune other DB parameters to give the backend more resources to work with. If the DB is swapping out to virtual memory alot that makes for a HUGE performance hit. With an RDBMS the goal is to have the table(s) cached in RAM, not read from disk. Perhaps a few more bars of memory and/or some configuration changes and a couple of well place indexes are all you need.

      You may want to look into the use of "OFFSET" and "LIMIT" clauses in your SQL statement. The backend still has to run the whole query, but you only get the rows you need back. If you are fetching the entire table in Perl just to output the last 25 rows you are wasting a LOT of CPU, let the DB do that for you because it's optimized to do so.

      If by chance you are on PostgreSQL . . . Have you done a "VACCUM ANALYZE" which allows the query optimizer to gather statistics on past searches which it uses to find better search paths?

Re: Database searches with persistent results via CGI
by samgold (Scribe) on May 10, 2002 at 20:16 UTC
    It would really help to know what database you are using.
    At the company that I was working at, until I was laid-off :( , we were using an Oracle database and we were dealing with public records. I remember when we were designing a website we were having problems with the performace of doing a search on smith. We ended up using function based indexes and limiting the number of records that were returned. First we ran all of our names through a program that would CASS certify them. Then we used a function called stripstring() which would remove all spaces,', and - and upper case the new string. Then we would index that value using a function based index. When the end user entered the name John Smith we would convert that to JOHNSMITH and then use that value to query the database. We also limited the search to 200 records and it only took a few seconds to return the results. I hope this helps.

    Sam
Re: Database searches with persistent results via CGI
by MrCromeDome (Deacon) on May 10, 2002 at 22:26 UTC
    Most excellent advice from everyone :) I love this place :)

    The site when originally developed started with MSSQL7, and has since migrated to MSSQL2K. While I wasn't too impressed with early incarnations of the product, I do have to admit that it's getting better. Would still love to try on Oracle sometime. . . but I digress.

    Following one suggestion, we tweaked the indexes on the master name table. We have been able to squeeze some more performance out of the initial name search, and are starting to look at the rest of your suggestions on how to best cache and redisplay the results of user queries.

    Currently, the data set grows once a week by several thousand names, but will likely be updated on a daily basis in the not-so-distant future. Being more forward-thinking this time, we're going to plan for daily updates NOW so something like this doesn't bite us in the rear again ;)

    Thank you all for the time and effort of responding to this. If my little bit of additional information adds to/changes any suggestions, I'd love to hear those as well.

    Respectfully (and thankfully!) yours,
    MrCromeDome

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://165472]
Approved by derby
Front-paged by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2014-04-17 03:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (439 votes), past polls