Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Database Comparison

by aartist (Monk)
on Jul 01, 2011 at 16:37 UTC ( #912377=perlquestion: print w/replies, xml ) Need Help??
aartist has asked for the wisdom of the Perl Monks concerning the following question:

What are the perl tools available for comparison between two databases for given queries ? I have Oracle databases with very large volume (more than billion rows). Database are hosted on 2 different servers. I like to have detailed reports for mis matches. I can extract the data to the CSV file and make the comparision, but knowledge of existing tools/modules will help.


Replies are listed 'Best First'.
Re: Database Comparison
by davido (Archbishop) on Jul 01, 2011 at 17:36 UTC

    If fetching 1000 rows with DBI takes 1 second, and comparing 1000 rows using Data::Compare takes a half-second, and then you sleep for two seconds to avoid grinding your network and servers to a halt, it will take you about 40 days to compare a billion rows (if my guesstimations are anywhere near correct). If fetching 1000 rows only takes half that long, and comparing only takes half that long, and you sleep for only one second between iterations, you're down to 20 days. :)

    DBIx::Compare exists for some databases. It's probably not guaranteed to be the fastest solution, but it's ready to use. On a quick look-through of its documentation I didn't happen across any throttling suggestions, but you'll probably want to find some way of preventing excessive load.


Re: Database Comparison
by sundialsvc4 (Abbot) on Jul 01, 2011 at 21:16 UTC

    When you reach this volume of data, especially when they are on two different servers (hence... a network of some kind will now be involved ...), you're simply going to have to devise some strategy that will enable you to “divide and conquer.”

    There are various possibilities to consider, depending on exactly what kind of comparisons you want to make.   For example, I once worked on a similar task where we pulled row-counts and MD5-sums of blocks of records, then compared those results, gradually zeroing-in on the differences.   (These queries focused on just one server at a time.)

Re: Database Comparison
by JavaFan (Canon) on Jul 04, 2011 at 22:59 UTC
    Do a select * from table order by some-key from both tables (assuming they have a unique index...), then repeatedly fetch rows from either side, reporting mismatches. Unless your database is cold, the network ought to be the bottle neck here - assuming your client libraries negotiate package size on your behalf. If the database is cold, reading from disk may be the bottleneck (depends a bit on the type of index), but there will be no way around that. If it's got to be read from disk, it's got to be read from disk.
Re: Database Comparison
by TomDLux (Vicar) on Jul 04, 2011 at 14:02 UTC

    Are there backups / file copies you can work with?

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: Database Comparison
by ForgotPasswordAgain (Deacon) on Jul 04, 2011 at 22:45 UTC

    I haven't worked with a billion, but 100 million yes. I think you need to have detailed knowledge of how your database (engine) works, your hardware, and use specific knowledge of your database tables; not a general out-of-the-box solution. I (based on ideas/code of smarter colleagues :) have used a similar strategy to that of sundialsvc4, which is also like what mk-table-sync (from Maatkit for MySQL) uses. That is written in Perl, incidentally, and according to that page started partly from a Perlmonks discussion. Basically the idea is to "chunk" your table, take md5sums and normalize the width of the data (LEFT, RIGHT, HEX), and BITXOR to get a quick checksum of the chunk. (I don't know the Oracle equivalents.) This way you determine which chunks are different, then you do a similar thing for the rows.

    There are lots of details, though; for example, how do you handle floating point comparison? Are your primary keys integers? Single-column PK, or multi-column? Are they densely or sparsely distributed? Is your content fat (wide text) or a few numeric columns?

    And where are the real bottlenecks? What davido seems to be suggesting is that the network is one, but maybe not. On an internal network, it can be fast to go from RAM of the database, across a network socket, into RAM of your Perl script. On the other hand, it's generally hideously slow to read/write things from/to disk (so we avoid big temporary tables, filesorts, in database queries, for example; it can lead to counter-intuitive stategies, like preferring to SELECT 100k rows and group in Perl, rather than use a GROUP BY in the database, where it might create temporary tables on disk. But with a billion rows, you're probably not going to have that already all in RAM ;).

    Sorry if what I wrote is a bit incoherent, basically stream of thought.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://912377]
Front-paged by Arunbear
and monks are getting baked in the sun...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2018-03-18 17:39 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (230 votes). Check out past polls.