Database Comparison

aartist has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Database Comparison by davido (Cardinal) on Jul 01, 2011 at 17:36 UTC
If fetching 1000 rows with DBI takes 1 second, and comparing 1000 rows using Data::Compare takes a half-second, and then you sleep for two seconds to avoid grinding your network and servers to a halt, it will take you about 40 days to compare a billion rows (if my guesstimations are anywhere near correct). If fetching 1000 rows only takes half that long, and comparing only takes half that long, and you sleep for only one second between iterations, you're down to 20 days. :) DBIx::Compare exists for some databases. It's probably not guaranteed to be the fastest solution, but it's ready to use. On a quick look-through of its documentation I didn't happen across any throttling suggestions, but you'll probably want to find some way of preventing excessive load. Dave	[reply]
Re: Database Comparison by sundialsvc4 (Abbot) on Jul 01, 2011 at 21:16 UTC
When you reach this volume of data, especially when they are on two different servers (hence... a network of some kind will now be involved ...), you're simply going to have to devise some strategy that will enable you to “divide and conquer.” There are various possibilities to consider, depending on exactly what kind of comparisons you want to make. For example, I once worked on a similar task where we pulled row-counts and MD5-sums of blocks of records, then compared those results, gradually zeroing-in on the differences. (These queries focused on just one server at a time.)	[reply]
Re: Database Comparison by JavaFan (Canon) on Jul 04, 2011 at 22:59 UTC
Do a `select * from table order by some-key` from both tables (assuming they have a unique index...), then repeatedly fetch rows from either side, reporting mismatches. Unless your database is cold, the network ought to be the bottle neck here - assuming your client libraries negotiate package size on your behalf. If the database is cold, reading from disk may be the bottleneck (depends a bit on the type of index), but there will be no way around that. If it's got to be read from disk, it's got to be read from disk.	[reply] [d/l]
Re: Database Comparison by TomDLux (Vicar) on Jul 04, 2011 at 14:02 UTC
Are there backups / file copies you can work with? As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply]
Re: Database Comparison by ForgotPasswordAgain (Priest) on Jul 04, 2011 at 22:45 UTC
I haven't worked with a billion, but 100 million yes. I think you need to have detailed knowledge of how your database (engine) works, your hardware, and use specific knowledge of your database tables; not a general out-of-the-box solution. I (based on ideas/code of smarter colleagues :) have used a similar strategy to that of sundialsvc4, which is also like what mk-table-sync (from Maatkit for MySQL) uses. That is written in Perl, incidentally, and according to that page started partly from a Perlmonks discussion. Basically the idea is to "chunk" your table, take md5sums and normalize the width of the data (LEFT, RIGHT, HEX), and BITXOR to get a quick checksum of the chunk. (I don't know the Oracle equivalents.) This way you determine which chunks are different, then you do a similar thing for the rows. There are lots of details, though; for example, how do you handle floating point comparison? Are your primary keys integers? Single-column PK, or multi-column? Are they densely or sparsely distributed? Is your content fat (wide text) or a few numeric columns? And where are the real bottlenecks? What davido seems to be suggesting is that the network is one, but maybe not. On an internal network, it can be fast to go from RAM of the database, across a network socket, into RAM of your Perl script. On the other hand, it's generally hideously slow to read/write things from/to disk (so we avoid big temporary tables, filesorts, in database queries, for example; it can lead to counter-intuitive stategies, like preferring to SELECT 100k rows and group in Perl, rather than use a GROUP BY in the database, where it might create temporary tables on disk. But with a billion rows, you're probably not going to have that already all in RAM ;). Sorry if what I wrote is a bit incoherent, basically stream of thought.	[reply]


Just another Perl shrine
	PerlMonks