Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Use Placeholders. For SECURITY and (sometimes) for PERFORMANCE

by jZed (Prior)
on Nov 14, 2003 at 20:37 UTC ( #307205=perlmeditation: print w/replies, xml ) Need Help??

If you have not already read Use Placeholders for SECURITY, read that first, it is a more important point. But if you do serious database work, you should probably consider my point below *after* fully implementing the SECURITY stuff.

Many (but not all) RDBMSs support pre-parsing and/or pre-optimizing of SQL statements. That means that the RDBMS has to do a certain amount of work every time it prepares a statement. In situations where you are going to execute the same statement with varying values, you can (sometimes immensely) increase performance with a prepare-once, execute-many approach. In all cases, read the DBD and RDBMS docs on placeholders and performance for your DBD/RDBMS, but in general here's an example of how to use placeholders for performance:

RIGHT: my $sth = $dbh->prepare( "UPDATE foo SET bar=7 WHERE baz=?" ); for my $val(@values) { $sth->execute($val); } WRONG: for my $val(@values) { $val = $dbh->quote($val); my $sth = $dbh->do( "UPDATE foo SET bar=7 WHERE baz=$val" ); }
The WRONG code does a prepare() and an execute() i.e. do() every time through the loop. The RIGHT code does the prepare() once at the top of the loop, and then does an execute() each time through the loop.

Replies are listed 'Best First'.
Re: Use Placeholders. For SECURITY and (sometimes) for PERFORMANCE
by etcshadow (Priest) on Nov 14, 2003 at 21:33 UTC
    Yes, but it can drastically degrade performance (in very rare circumstances which I'll describe below).

    Don't get me wrong, the default position should always be to use binds. Not only for the reasons that have been mentioned by others, but also because it can improve database performance in other, less direct ways. For example (speaking from experience with Oracle, in particular here, but I imagine that other RDBMSs have similar issues): even if you do not cache your statement handles, or use prepare_cached (and there are very good reasons not to do so, sadly), the database server, itself, caches various information based on the sql text. When it reencounters the same (textually) sql that it has seen before, it just looks up things like the parse-tree and the execution path, instead of recomputing them. Thus, if you bind your sql, and even if you throw away the statement handle... the database server doesn't, so not all of the prepare step is repeated. (Of course, some of the work is repeated, so it's still better to reuse your statement handles if you can.)

    Anyway, I promised an example of drastically degraded performance with bound variables, and here it is: cost-based optimization. Say you've got a table ("thing") with a column called "type", and a million rows. Of those million rows, 50 are "type='special'" and all of the rest are "type='general'. You also have an index on thing.type. Now, run the query:

    select * from thing where type=?
    What is the proper execution plan? Should you use the index or not? Who knows? It depends on the value of the bind-parameter. If its value is 'general' then, hell no, you should not use the index. If its value is 'special', then hell yes, you should use the index. In fact, if it is anything other than 'general' you should. A CBO with histogram statistics would know the proper plan, if the value weren't bound.

    Anyway, that's an overly simplistic example, but the thing is: this can and does happen. Just this week, I was optimizing a query that executed in about a minute with all of its values unbound, but took essentially an infinite amount of time when the values were bound. Same exact query.

    Truth is, yes, start from the position of binding all of your literal values, but serious DB tuners can, and do, find situations where bind values can bite you in the ass.

    Not an editor command: Wq
      Oh, and I forgot to mention, under the "other, less direct ways" that binds improve performance...

      Because of the fact that the RDBMS maintains a cache by sql-text... the more distinct sql that you pipe into your database, the bigger this cache has to be. (In oracle, you can investigate this cache by querying the v$sqlarea system view.) As this cache grows bigger, its hash buckets spill over, it consumes valuable shared pool memory, and various other Bad Things occur.

      Many people new to database development (and not using bound values), eventually hit a wall where their database seems to fall apart because of these problems. This is perhaps one of the least well understood causes of RDBMSs failing to scale.

      Anyway, just one more piece of fuel for the "use binds" fire.

      Not an editor command: Wq
      There are usually ways to trick the RDBMS query optimizer into doing the "right" thing, even with border case queries like the one you show. I don't know about Oracle, but for Sybase the optimizer will use a generic weighting when it doesn't know the SARGS (search arguments) or when the SARGS are the result of functions or local variables (as is the case for bound variables via placeholders). If the index histogram is reasonable then the generated query plan is probably going to be quite reasonable - after all it's the same thing that happens when using stored procedures, and they work really well (with Sybase).


      PS - we're drifting quite a long way from perl here...

        ...for Sybase the optimizer will use a generic weighting when it doesn't know the SARGS...

        Well, yes, and that's exactly the problem. Generic weighting simply tells you that you have a highly-unselective index, and, thus, the index will never be used.

        This is actually a common case with columns like "status" on task-tracking or ticket-tracking systems. If you've got a database with a million tasks, and nearly all of them are "COMPLETED", but the ones that you are typically interested in are "OPEN" (a small minority of the total population), then, bang, it's the exact issue I showed. You RDBMS will say "I don't know what value they want... but I know that the average value occupies half of the table. Accessing half of a table by index is a bad idea, so I'll full-table-scan."

        And, yeah, we are getting away from perl somewhat, but the original post was really about database performance tuning, so I don't feel so bad answering in kind. Also, there are a lot of "grey areas" around perl that folks on this site deal with a lot: CGI, HTML, shell/command-line, file-system, and databases. If perl is commonly used as a glue between these pieces... the line starts to blur a little bit with respect to what is fair game for discussion.

        Not an editor command: Wq
        In Oracle
        analyze table <table name> compute statistics for all indexed columns +size 254;
        with generate a histogram of weightings for each range of values that are in the indexed columns. Parse trees can be generated on a case by case basis. In the latest versions you can "fake" the stats to get the optimizer to behave properly.


      Since the query plan is done at prepare time, queries like this can cause a problem:
      select * from foo where bar = ? and baz like ?
      If there are two indexes, one on bar, and one on baz, which should be used? If baz is '%' (nothing but a wildcard), then it would be the wrong index to use, but if it's 'some long prefix%' then it's probably a good index to use, but there's no way to tell at prepare time. Trick: if you know baz is always the wrong index in this case but the database picks that index anyway (and your database does not support hints), you can fool it into not using the index (and hopefully using the correct one) by doing something like this (example also includes how to not use an index on a numeric column):
      select * from foo where bar = ? and baz || '' like ? and some_number + 0 = ?
        Yeah, there are lots of good examples... I was just trying to come up with the very simplest query that would demonstrate the problem.

        Honestly, single-table queries aren't likely to be all that problematic, no matter what. You do a full table-scan on a million-row table... well, that's 5 seconds you wish you hadn't spent, but it's not gonna totally kill you, unless it is being hit with great frequency. It's really when you get into large, complicated plans involving lots of joins and various other nasties, that tuning gets to be really critical (except, of course, for very frequently accessed queries).

        Somebody could write a whole book on database tuning... oh, wait... several people already have :-P

        Not an editor command: Wq
Re: Use Placeholders. For SECURITY and (sometimes) for PERFORMANCE
by hardburn (Abbot) on Nov 14, 2003 at 21:12 UTC

    Don't forget about prepare_cached(). This can really help under mod_perl.

    This code will do little more than suck up more memory:

    for my $val(@values) { $val = $dbh->quote($val); my $sth = $dbh->prepare_cached( "UPDATE foo SET bar=7 WHERE baz=$val" ); $sth->execute(); }

    This code takes up a little more memory, but also avoids having to re-prepare what is really the same statement with different input:

    for my $val(@values) { my $sth = $dbh->prepare_cached( "UPDATE foo SET bar=7 WHERE baz=?" ); $sth->execute($val); }

    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

      Errm... if I may - that's not a very good example of using prepare_cached(): just moving the prepare() out from the for loop and using placeholders would have worked fine (and yes, I'm fairly sure that you know this - just pointing this out for others reading this thread).

      prepare_cached() is really for the situation where you are likely to call a particular query more than once, possibly from different parts of your program, but not sequentially. In that situation DBI and the RDBMS will keep a copy of the query (and its query plan) on hand and re-use it when it is requested. You should keep in mind when using it that prepare_cached() will consume resources on the database server, because it will keep all of the queries that each client requests on hand/in cache until the clients disconnect. In some cases these query plans can be shared between clients (i.e. two different clients executing the same query), but not always (in particular I don't think that Sybase and/or MS-SQL will share query plans for prepared queries that use placeholders).

      These are all items that need to be kept in mind - as with most things that pertain to database tuning/optimization the advisability of using a particular solution "depends" on the local circumstances.


        that's not a very good example of using prepare_cached(): . . .

        That's why I mentioned mod_perl, which, when combined with Apache::DBI, will allow you to keep prepare_cached() statements around until Apache is shut down.

        I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
        -- Schemer

        : () { :|:& };:

        Note: All code is untested, unless otherwise stated

Re: Use Placeholders. For SECURITY and (sometimes) for PERFORMANCE
by pg (Canon) on Nov 14, 2003 at 21:14 UTC

    A very good point, and something that everyone serious about database performance should know.

    A prepared statement is the same as a piece of compiled code. At the time you prepare it, it gets compiled and cached. This is much faster than execute a SQL statement on fly.

    Have said this, I would like to mention another trick related to "compiled" stuffs in database. Many times, you can improve performance of queries against big tables, by creating a view on top of the table. By doing this, the view get "compiled". Later instead of querying the table, you just query the view.

    It sounds like nothing, and only difference is that you moved some of you where clauses from your query statement to the create view statement. But actually there is a big difference, as at the time you create the view, your where clauses get "compiled" as part of the create view statement.

    I recently had an application's execution time cut 96%, by creating view, along with other tunings.

    Database (or poor database design and tuning) is one of the major performance bottle necks you see from time to time nowadays.

Re: Use Placeholders. For SECURITY and (sometimes) for PERFORMANCE
by Coruscate (Sexton) on Nov 15, 2003 at 05:42 UTC

    Just a TIMTOWTDI thought. I do not have time to benchmark now, but will get to it later if nobody does it before I get the chance. The "right" way example presented by jZed prepares the statement once then executes multiple times for each value. In such situations, I tend to do this another way. I form a statement that will do all changes in one blow, then run that through. My method of doing it the "right" way is as such:

    my $sta = 'UPDATE foo SET bar=? WHERE '; $sta .= (join ' OR ', ('baz=?') x @values); my $sth = $dbh->prepare($sta); $sth->execute(7, @values);

      Making fewer database calls when possible is good. You can also do (if there are not too many @values):
      my $sta = 'UPDATE foo SET bar = ? WHERE baz in (", join(',', "?" x @va +lues), ")";
      But I think jZed was just offering a generic example, and on these sorts of updates, I think you often want to set 'bar' equal to something different for each 'baz', so the multiple execution version is more appropriate in that case.
      This can be a good trick... but it is not generally applicable... since it won't work, in general, for multiple inserts. I say "in general" because there are cases where you can do things like:
      insert into table1 (val1,val2,val3) select val1,val2,val3 from table2 where id in (?,?,?,?,...)
      but... again, that will only be an available option some of the time.

      Not an editor command: Wq
      That might work reasonably well for a fairly limited number of arguments. If your @values array is large (500+, say), then hitting the database 500 times may be faster than having a query with 500 ORs in the WHERE clause, because the optmizer is going to have a hell of a time trying to figure out the best access path, and because there may be a limit to the number of placeholders that a single statement can have (AFAIK for Sybase this limit is 256, btw).


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://307205]
Approved by Elian
Front-paged by Courage
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2023-03-22 10:29 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (60 votes). Check out past polls.