Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Database Comparison

by ForgotPasswordAgain (Priest)
on Jul 04, 2011 at 22:45 UTC ( [id://912721]=note: print w/replies, xml ) Need Help??


in reply to Database Comparison

I haven't worked with a billion, but 100 million yes. I think you need to have detailed knowledge of how your database (engine) works, your hardware, and use specific knowledge of your database tables; not a general out-of-the-box solution. I (based on ideas/code of smarter colleagues :) have used a similar strategy to that of sundialsvc4, which is also like what mk-table-sync (from Maatkit for MySQL) uses. That is written in Perl, incidentally, and according to that page started partly from a Perlmonks discussion. Basically the idea is to "chunk" your table, take md5sums and normalize the width of the data (LEFT, RIGHT, HEX), and BITXOR to get a quick checksum of the chunk. (I don't know the Oracle equivalents.) This way you determine which chunks are different, then you do a similar thing for the rows.

There are lots of details, though; for example, how do you handle floating point comparison? Are your primary keys integers? Single-column PK, or multi-column? Are they densely or sparsely distributed? Is your content fat (wide text) or a few numeric columns?

And where are the real bottlenecks? What davido seems to be suggesting is that the network is one, but maybe not. On an internal network, it can be fast to go from RAM of the database, across a network socket, into RAM of your Perl script. On the other hand, it's generally hideously slow to read/write things from/to disk (so we avoid big temporary tables, filesorts, in database queries, for example; it can lead to counter-intuitive stategies, like preferring to SELECT 100k rows and group in Perl, rather than use a GROUP BY in the database, where it might create temporary tables on disk. But with a billion rows, you're probably not going to have that already all in RAM ;).

Sorry if what I wrote is a bit incoherent, basically stream of thought.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://912721]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-25 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found