Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Of Mysql, DBI and PK best practices

by jimbus (Friar)
on Apr 17, 2007 at 13:32 UTC ( #610518=perlquestion: print w/ replies, xml ) Need Help??
jimbus has asked for the wisdom of the Perl Monks concerning the following question:

I've been using a method suggested in the here early in my tenure at the forum, but recently I've been told it's bad, but I haven't been able to get a straight answer on what the "correct" way is...

Basically, I want to insert a record unless it violates the primary key. If it violates the PK, I want to update the row with the new information. To do this, I've been using something like:

eval { $insert_stmt->execute($pk,$what, $ever, $else) } if ($@) { $update_stmt->execute($what, $ever, $else, $pk) }

I understand that the eval traps any error, PK or otherwise and tries the update... I've even struggled with debugging malformed mysql statements until I remembered this, but I don't have a better way to do it.

This is all about processing log files, and I doubt I'm the only person in the world that has data for a primary key's record potentially coming more than one "processing session". Can someone help me with a best practice for this?

thanks,

--Jimbus aka Jim Babcock
Wireless Data Engineer and Geek Wannabe
jim-dot-babcock-at-usa-dot-com

Comment on Of Mysql, DBI and PK best practices
Download Code
Re: Of Mysql, DBI and PK best practices
by dragonchild (Archbishop) on Apr 17, 2007 at 13:42 UTC
    It sounds like you have more than one piece of information for a given PK. In that case, you should be using two tables. Think about it in terms of invoices and lineitems. You go buy 3 things from CDW. You get one invoice indicating the date of purchase, the payer, the shipto address, CDW's invoice ID, etc. That's a record in the invoices table. But, you need to indicate the three lineitems. So, you have a lineitems table which has a FK back to the invoices table.

    Same idea here.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Dragonchild,

      I was deliberately being 'data neutral' to keep focus on the mechanics... The PK is a date-time based on yyyy-mm-dd hh::00:00 and the information comes from multiple nodes from one system. In the end, I need a report on the usage of all of the nodes and the total for the system based on time. Additionally, log files are 'line count' based, so hours often overlap the files which are processed every 15 minutes.

      However, thanks for taking the time :)


      --Jimbus aka Jim Babcock
      Wireless Data Engineer and Geek Wannabe
      jim-dot-babcock-at-usa-dot-com
Re: Of Mysql, DBI and PK best practices
by derby (Abbot) on Apr 17, 2007 at 13:43 UTC

    Your approach is somewhat reasonable and I can think of a few ways to improve it but why do people say it's *bad*? I would do the following to improve the situation:

    • look at $@ and make a decision based on PK error or another error.
    • wrap the second execute in an eval also
    • if the second execute fails, write that to a log

    -derby
Re: Of Mysql, DBI and PK best practices
by herveus (Parson) on Apr 17, 2007 at 14:23 UTC
    Howdy!

    The alternative is to first select the PK from the table. If you get a result, then you do the update, otherwise do the insert. That means that errors will be real errors. That means you always execute two statements, but your normal condition is that there are no errors.

    yours,
    Michael

      If there's only ever one client, that's fine. Otherwise, this contains a race condition that the OP does not have. Imagine two clients select, both see nothing, both try to insert, one fails. At that point, the one that failed to insert should update, exactly like what the OP is already doing (but without the select).

      Update: It also occurs to me that if you're not concerned about a race condition, you could do something like this without the select. First try to update, then check to see how many rows were affected. If there weren't any, do the insert. As I say, this has a race condition also, but it will sometimes be one statement instead of always being two.

Re: Of Mysql, DBI and PK best practices
by RMGir (Prior) on Apr 17, 2007 at 14:31 UTC
    Note that you can combine any of the solutions suggested above with "%seenKey" hash, so you can at least know if _this instance_ of the loader has already seen a particular key and go directly to the update branch in that case.

    Mike
Re: Of Mysql, DBI and PK best practices
by jimbus (Friar) on Apr 17, 2007 at 14:32 UTC

    I parallel posted this to MySQL's forum and it was suggested that I was looking for the MySQL syntax "INSERT ... ON DUPLICATE KEY UPDATE".

    I'd never heard of such a thing and a brief google search showed it is probably what I'm looking for (I'm a pretty big believer that with reasonable usage, the db is usually going to be more efficient that anything I write). I spent hours searching Google for options on what I was seeing: "MySQL primary key violation on insert" and got no where.


    --Jimbus aka Jim Babcock
    Wireless Data Engineer and Geek Wannabe
    jim-dot-babcock-at-usa-dot-com
      Hi jimbus,

      I've used ON DUPLICATE KEY UPDATE in MySQL and it pretty much does exactly what it says on the tin. Seems to work well.

      A more database-neutral alternative is to select the key from the table, and if it doesn't exist then insert, otherwide update.

      You'll need to wrap this in a transaction to avoid the race condition kyle mentions above (unless you are the only process working on the db). In MySQL this means you either need to get a table lock, or switch your table type to InnoDB and do a row-level transaction as described on this page: InnoDB locking reads.

      By the way, ON DUPLICATE KEY UPDATE will lock the auto-increment table if you have an auto-increment field in your table, so this situation is one that can cause locking problems whichever solution you choose, if you have a lot of concurrent inserts/updates.
      (update: will only lock the auto-increment table if an insert actually takes place, I think - still, a potential problem)

      As I see it, the problem with the original solution (use insert, check for error, use insert if update didn't work) is, what if the error wasn't a duplicate key error? To do this properly, I think you'd need to check the error type.

      That way you could try again with an insert if the error was a duplicate key error, or if it was a different error then report the error and exit as usual. You'd need to ensure that RaiseError was switched off on your DBI handle.

      HTH!

      Best wishes, andye

Re: Of Mysql, DBI and PK best practices
by ptum (Priest) on Apr 17, 2007 at 14:36 UTC

    Personally, I tend to do two things differently:

    • First, I select a full list of primary keys and put it in a hash. Before I try to do an insert, I check to see if the primary key for that record exists in the hash. If it does, I do an update instead of an insert. If the primary key isn't in the hash, then I do an insert, and add it to the hash if the insert was successful. This seems to be a little cleaner and (potentially) halves my number of SQL executions.
    • Second, I tend to use the RaiseError option in my database connection so that my program doesn't die when DBI encounters an error. I examine $DBI::errstr to specifically identify the nature of the error (in Oracle, you get error codes) so that I cover my bases for the unexpected. There's no sense in attempting an update if the table doesn't exist or your SQL is malformed. I'm not a MySQL guy, but the use of eval and $@ seems less than precise -- you might react to what you think is a duplicate insert error but is really something much more severe.

    Update: As noted by kyle above, however, if this code is executed by multiple users/processes, you could end up with a race condition, such that your inserts might still fail because some other guy just added that record since the time that you selected all the primary keys.

      Not unless you do your work within a transaction.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Of Mysql, DBI and PK best practices
by Errto (Vicar) on Apr 17, 2007 at 18:01 UTC
    It would appear that MySQL has native support for this, albeit in a non-standard way. I know that Oracle does as well through the MERGE statement. A more standard-respecting way to do it would be to try the update first and then, if the update returns no rows, try the insert, ie.
    my $rowcount = $update_stm->execute(@args); if ($rowcount == 0) { $insert_stm->execute(@other_args); }
Re: Of Mysql, DBI and PK best practices
by roboticus (Canon) on Apr 17, 2007 at 19:07 UTC
    jimbus:

    For the "general" solution, I'd suggest that you let the database engine handle it, so you can avoid race conditions, etc. You can give it to the database engine by writing an appropriate stored procedure, or a simple if statement. Most database engines I'm aware of will let you do something similar to:

    if exists (select * from table where PK='value') update table set col1=val1, col2=val2 where PK='value' else insert table (PK, col1, col2) values ('value', val1, val2)
    ...roboticus

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://610518]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-10-02 10:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (54 votes), past polls