Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: Database Comparison

by ForgotPasswordAgain (Deacon)
on Jul 04, 2011 at 22:45 UTC ( #912721=note: print w/replies, xml ) Need Help??

in reply to Database Comparison

I haven't worked with a billion, but 100 million yes. I think you need to have detailed knowledge of how your database (engine) works, your hardware, and use specific knowledge of your database tables; not a general out-of-the-box solution. I (based on ideas/code of smarter colleagues :) have used a similar strategy to that of sundialsvc4, which is also like what mk-table-sync (from Maatkit for MySQL) uses. That is written in Perl, incidentally, and according to that page started partly from a Perlmonks discussion. Basically the idea is to "chunk" your table, take md5sums and normalize the width of the data (LEFT, RIGHT, HEX), and BITXOR to get a quick checksum of the chunk. (I don't know the Oracle equivalents.) This way you determine which chunks are different, then you do a similar thing for the rows.

There are lots of details, though; for example, how do you handle floating point comparison? Are your primary keys integers? Single-column PK, or multi-column? Are they densely or sparsely distributed? Is your content fat (wide text) or a few numeric columns?

And where are the real bottlenecks? What davido seems to be suggesting is that the network is one, but maybe not. On an internal network, it can be fast to go from RAM of the database, across a network socket, into RAM of your Perl script. On the other hand, it's generally hideously slow to read/write things from/to disk (so we avoid big temporary tables, filesorts, in database queries, for example; it can lead to counter-intuitive stategies, like preferring to SELECT 100k rows and group in Perl, rather than use a GROUP BY in the database, where it might create temporary tables on disk. But with a billion rows, you're probably not going to have that already all in RAM ;).

Sorry if what I wrote is a bit incoherent, basically stream of thought.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://912721]
[Corion]: On the upside, I spend a lot of time thinking this weekend about how to actually implement rate limiting for futures, and if things work out, maybe even loading a configuration from an external file makes sense
[Corion]: I've also found some interesting invariants that I have to think/write about more. A simple rate limiter will never change the order of the input, while a limiter that allows for parallel execution will change the order. But my API currently allows for bo
[Corion]: ... for both, and I'm not sure if I want to add the cruft from the parallel API (a token that you need to hold on to while you hold the lock) to the rate limiting API too, to allow seamless up/downgrades, or not.
[Corion]: Also, rate limiting will look great with await: my $token = await $limiter-> limit($hostname); instead of my $f = $limiter->limit( $hostname )->then(sub { my( $token)=@_; ... });

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2017-10-23 08:25 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (277 votes). Check out past polls.