Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re (tilly) 2: Performance quandary

by tilly (Archbishop)
on Feb 24, 2002 at 20:43 UTC ( #147209=note: print w/ replies, xml ) Need Help??

in reply to Re: Performance quandary
in thread Performance quandary

That MySQL comment bothers me.

With a properly designed BTree, the access time should scale logarithmically with the size of the table. The n*log(n) should be the overall time to build up the whole table (n operations, each of which is log(n)).

Typical data sets range between, say, 1000 and 10**9 rows. Over that range log(n) only changes by a factor of 3, so for all intents and purposes it is just some constant. In real databases the access time is dominated by the time to seek to disk, and so (as my link in my previous post pointed out) the locality of reference of a BTREE can be a good win over hashing because it is more likely to hit pages in cache. Sure, asymptotically that log(n) is infinitely worse than any possible constant, but real life doesn't tend to reach that asymptotic nightmare. (Besides which BTREEs have a good worst-case guaranteed performance. Hashes don't.)

On to other issues.

Using a RAM disk does a lot to remove filesystem overhead. That should, among other things, catch most of the possible speed wins from extracting parallelism in commits (through insert buffers and other tricks). Come to think of it, most of the rest of the filesystem overhead can be removed just by naming your file undef, which tells Berkeley DB that it is dealing with an in memory database, removing the OS's filesystem abstraction layers from the picture.

You do have a good point about doing too much Perl in the processs. Personally I strongly suspect that just changing to a better data structure will win the day. Alternately, if you can, using native hashes is better than pulling in Berkeley DB. But given Perl's memory overhead, I would be concerned about hitting the 32-bit addressing barrier, even if you have enough RAM in the machine to do it. (A wall we will be hearing about more and more often as time goes by...)

Comment on Re (tilly) 2: Performance quandary
Re: Re (tilly) 2: Performance quandary
by Xanatax (Scribe) on Feb 25, 2002 at 01:14 UTC
    it seems to me you are wanting an in-memory data structure, but you are tying the data to a database structure. if you are going to go this route, i would suggest looking into the index optimizations of BDB as tilly suggested with an 'undef' file, MySQL HEAP tables, and pure-perl structures like DBD::AnyData. A good index system will make or break your seek time, and you are seeking alot aren't you? i'm not sure how much overhead a connection to BDB or MySQL would add, but if it is significant DBD::AnyData in-memory table may be less by virtue of running as a perl module and not connecting to anything at all.

    if you are doing concurrent reads/writes the locking is of importance to you as well. you probably want to avoid anything that locks more than just the relevent rwo. if you can guarentee integrity another way, or 100% accuracy is unnecessary, maybe you can avoid locking the table at all.

    if you aren't doing concurrent processing, since you are farming the data-storage out to BDB you might be able to get away with (A) building the trees you have to make in a child process so several can be done concurrently, and (B) splitting your tables so you can do concurrent reads/writes. this probably won't get you much, but it would allow you to avoid some locking problems if your tables do lock, and it would pottentially allow you to run a separate DB and/or tree building on a second proceesor/machine to share the load.

    you also want to avoid anything resembling transaction logging as that inevitably is designed to trade speed for data integrity. i don't know BDB well, but in MySQL you would want to avoid InnoDB tables just for this reason.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://147209]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2015-02-27 13:30 GMT
Find Nodes?
    Voting Booth?

    On my keyboard, Caps lock is:

    Results (444 votes), past polls