I haven't worked with a billion, but 100 million yes. I think you need to have detailed knowledge of how your database (engine) works, your hardware, and use specific knowledge of your database tables; not a general out-of-the-box solution. I (based on ideas/code of smarter colleagues :) have used a similar strategy to that of sundialsvc4, which is also like what mk-table-sync (from Maatkit for MySQL) uses. That is written in Perl, incidentally, and according to that page started partly from a Perlmonks discussion. Basically the idea is to "chunk" your table, take md5sums and normalize the width of the data (LEFT, RIGHT, HEX), and BITXOR to get a quick checksum of the chunk. (I don't know the Oracle equivalents.) This way you determine which chunks are different, then you do a similar thing for the rows.
There are lots of details, though; for example, how do you handle floating point comparison? Are your primary keys integers? Single-column PK, or multi-column? Are they densely or sparsely distributed? Is your content fat (wide text) or a few numeric columns?
And where are the real bottlenecks? What davido seems to be suggesting is that the network is one, but maybe not. On an internal network, it can be fast to go from RAM of the database, across a network socket, into RAM of your Perl script. On the other hand, it's generally hideously slow to read/write things from/to disk (so we avoid big temporary tables, filesorts, in database queries, for example; it can lead to counter-intuitive stategies, like preferring to SELECT 100k rows and group in Perl, rather than use a GROUP BY in the database, where it might create temporary tables on disk. But with a billion rows, you're probably not going to have that already all in RAM ;).
Sorry if what I wrote is a bit incoherent, basically stream of thought.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||