Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re^6: Comparing tables over multiple servers

by rnahi (Curate)
on Oct 27, 2005 at 21:28 UTC ( #503478=note: print w/replies, xml ) Need Help??

in reply to Re^5: Comparing tables over multiple servers
in thread Comparing tables over multiple servers

Just think about this: to move a table across the network you can either (a) make a physical copy, and this includes data, indexes, triggers, transaction caches, and whatever the DBMS keeps in store for that table, or (b) copy just the data as SQL statements or as exportable text, but then the receiving DBMS should rebuild the indexes.

Instead, to calculate a CRC you have just a table scan.

Therefore you should compare the time needed to calculate a SHA twice per record (one for the record and one to apply it to a global SHA) versus the time needed to pass all the above mentioned stuff over the network.

I put my bet on SHA. :)

  • Comment on Re^6: Comparing tables over multiple servers

Replies are listed 'Best First'.
Re^7: Comparing tables over multiple servers
by BrowserUk (Pope) on Oct 28, 2005 at 00:26 UTC

    You would? Not me.

    Did you take a good look at all the data conversions and substrings and stuff going on in that SQL? SQL can be pretty optimal at performing comparison, that's its bread and butter work, but those types of data manipulations and conversions are not it's strong suite.

    I attempted to verify my suspicions, but about half of the syntax in that article doesn't seem to be valid with the only SQL database I have available, but I'm betting (a coffee:) that it ain't quick on any platform.

    I would hazzard that dumping the table using the export facilty and using a dedicated binary digest(or) program would be considerably faster.

    Either way, once the determination of difference is made, you have still to correct it and that means transmitting the data. Easier, surer and possibly quicker to just zip up the dumped table and send it I think.

    Unless the data involved is already compressed binary--jpgs or similar--then the 100GB would probably reduce to 25% or so, and transmitting 25GB at 100Mb/s will take 34 minutes, assuming no contention.

    Running a dedicated md5 executable on 1GB takes around 20 seconds, so around 1/2 hour for 100GB, but that is calculating a single hash from a contiguous datastream.

    You're suggesting calculating 2 hashes for every piece of data, retrieved in iddy biddy chunks and doing all the math in SQL?

    In the absence of evidence to the contrary, my money would be on the transmission finishing long before the checksumming.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://503478]
talexb strolls in on this hot, hot Monday morning.

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2018-06-18 12:26 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (109 votes). Check out past polls.