Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

DBI vs Bulk Loading

by jimbus (Friar)
on Sep 14, 2005 at 21:02 UTC ( #492019=perlquestion: print w/ replies, xml ) Need Help??
jimbus has asked for the wisdom of the Perl Monks concerning the following question:

Brethren,

Here's something I posted at forums.mysql.com. It's more of a mysql thing, but I am scriptong it in perl and you guys are infinitely more helpful then they are and a lot cooler, too. :)

I'm processing log files in perl and I'm using the timestamp as the primary key , but the files are segregated by another field, so timestamps can potentially be spread over more than one file... Originally, I was using the DBI interface to query on the timestamp, if it didn't exist, I inserted it, if it did, I summed the data and updated the record; which worked but was too slow.

So I googled a bit on MySql tuning and performance and found that the best way insert, speed wise, is to write the digested data to a CSV and bulk load it. What I need help with is how to duplicate my PK violation logic with this method.

One though I had was that I could write the file, run the bulk load, have it place error raising records in another file (I believe it will do that) and use my perl logic process the second signifigantly small file, but that seems like a hack... so I thought I would post and ask if anyone had a more elegant solution.

Thanks,

Jimbus

Never moon a werewolf!

Comment on DBI vs Bulk Loading
Re: DBI vs Bulk Loading
by LanceDeeply (Chaplain) on Sep 14, 2005 at 22:03 UTC
    how much of your process is updating records vs inserting new records?

    if it's mostly updates:
  • select out all the data into your perl program
  • sum logfiles against data structure in memory by timestamp key
  • write summed values to csv file
  • delete table
  • bulk load csv file

    if it's mostly inserting new records:
  • select out all the data into your perl program
  • sum logfiles against in memory data structure by timestamp key
  • run updates for timestamps that exist
  • write summed values to csv file for new timestamps
  • bulk load csv file

    i dont kow what kind of volume you are processing, but if the full table is too much for your process to handle, you can slice it up by day/hour.
    -HTH
Re: DBI vs Bulk Loading
by InfiniteLoop (Hermit) on Sep 14, 2005 at 22:06 UTC
    jimbus, here is one more hack:
    • Use the IGNORE construct with the Load data sql
    • Select the data in the table, to a csv file
    • Compare it with your original data, and add/input the duplicate data
Re: DBI vs Bulk Loading
by pboin (Deacon) on Sep 15, 2005 at 12:30 UTC

    I'm not sure exactly what you mean by 'summing the data' when you get the second hit on a key, but here's my first thought:

    I'd consider loading up a hash, always 'summing up' the values. Do that for all of your log files. Then, you'll have a hash all populated and ready to go. You can either use DBI to insert into mySql, or you can write out to CSV and load later. (Is this a regular or a one-time event?)

    PS: If you want performance, make sure you understand how your transactions work. It can make a huge difference, especially in my personal preference: SQLite.

Re: DBI vs Bulk Loading
by jimbus (Friar) on Sep 15, 2005 at 14:51 UTC

    By summing, I mean that if I had all of the records in a table without tstamp as a primary key, but wanted to create a new table that was modifiable to have it, I would:

    Create table new_table as select tstamp, sum(op100), sum(op700), sum(total) from old_table group by tstamp

    And then add my primary key rule

    The script runs every fifteen minutes. The data is from a web services broker that appears to only put one of 8 or 9 opcode types per file. so you'll see a file with 4-5k of WAP 1.X requests followed by a bunch of files that have 2 or 4 (no k) ldap updates or other such maintenance traffic, followed by a file with 2-4k of WAP 2.0 requests... All in All, I end up with a couple thousand files per day and I'm pulling around 800-850k lines of data and digesting it down to a max of 86,400. I'm thinking that I'm running about 1500 pk violations per day, which is why I can to the conclusion that my original algorithm was too inefficient.

    I don't think IGNORE is a valid option, because I don't want to lose any data. And it appears I was remembering Oracle when I thought it would write erring records to a file... or it was just wishful thinking.

    Since I know my range of time stamps and they should be limited to a query size of ~900 per run, that LanceDeeply's suggestion is my best option.

    Thanks!

    jimbus

    Never moon a werewolf!
Re: DBI vs Bulk Loading
by eric256 (Parson) on Sep 15, 2005 at 16:29 UTC

    What about having two tables. One would be a kind of staging ground. You process all the records into it store them. It has a unique primary key instead of the timestamp. Then use that table to load up your summed by timestamp table, and clear out the temp table. When inserting records then you could load multiple records per insert statment (i think i settled on like 100 records per insert for a table I have with 20 colums, so you could probably manage many more than that.) Weather you realy need the extra table or not will depend on the quantity of records you are processing every 15 minutse.

    Yet another optiuon would be to use the INSERT ON DUPLICATE KEY UPDATE syntax to have it sum them as it adds them. You could even right all those inserts into a text file then load the text file in mysql.

    Good luck.


    ___________
    Eric Hodges
Re: DBI vs Bulk Loading
by revdiablo (Prior) on Sep 15, 2005 at 17:34 UTC
    the best way insert, speed wise, is to write the digested data to a CSV and bulk load it. What I need help with is how to duplicate my PK violation logic with this method.

    Er, um, the bulk loading method is for, well, bulk loading. This usually means loading a dumpfile from somewhere else (be it a backup of the database, or a big chunk from somewhere else, etc). The bulk loading facilities are fast because they avoid a lot of the work a normal insert does.

    But most of the time, you want that work to be done. You want to let the database do this for you, so you don't have to. Having to do all the stuff you're skipping might end up costing as much as you're saving with the bulk loading, and it creates a lot of work, more potential bugs, and all manner of headaches. In short, only use bulk loading facilities when you're actually bulk loading.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://492019]
Approved by bunnyman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (12)
As of 2014-09-22 12:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (191 votes), past polls