http://www.perlmonks.org?node_id=781071


in reply to scripts stops running .. almost

Some databases don't write the data in properly without an explicit commit, filling up temporary structures. This is to ensure read consistency.

Maybe there is some form of commit like statement that will flush this database buffer, remove locks, relock and start again with the next block of data.

Replies are listed 'Best First'.
Re^2: scripts stops running .. almost
by Marshall (Canon) on Jul 17, 2009 at 16:37 UTC
    I've heard about this problem happening. I think tweetiepooh could be on to something important here. I was reading more about Berkley DB here:
    http://www.oracle.com/technology/documentation/berkeley-db/db/gsg_txn/C/index.html
    There is a lot of bookkeeping to keep track of a transaction and if you are in a situation where say 2 hours of inserts are one transaction which in theory could be aborted with no change to the DB, there's a lot of overhead there! A commit would say, "I'm finished with this one". I am not a DB guru. But I'm also wondering if there aren't some options that circumvent some of the normal transaction rollback and journaling for the case of a single user doing the initial DB create from scratch? I don't know. Just wondering if this initial build is somehow handled differently than the "online use" of thing thing once built?

    Update: I would leave ouput unbuffered until you get this working. But you should be aware that there is a significant performance penalty for that. In this case, we could be talking hours of difference! Get it working, then turn buffering back on and see what happens. Right now I am suspecting that tweetiepooh's idea of committing every 100 or whatever adds is gonna do something impressive.

Re^2: scripts stops running .. almost
by mzedeler (Pilgrim) on Jul 17, 2009 at 21:23 UTC

    The original Berkeley DB didn't support transactions. Even though the newer versions does support transactions, it doesn't seem that DB_File uses supports it.

    If it works as I used to know it, every change is written straight to the file as it is done, but you can't use the file size as a safe measurement of every write.

    A different way to speed up the load is randomizing the order of the keys (or a pseudo random map of the keys themselves, such as MD5). I know it sounds odd, but if you are using B-tree storage and the keys are sorted, you get very long load times because the tree is constantly being rebalanced.

    My suggestion with regard to trying hash storage still stands. Try that first.

        Yes. It does now. I didn't claim otherwise.