Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
go ahead... be a heretic
 
PerlMonks

Remote MySQL table comparison

by gmax (Abbot)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Aug 08, 2004 at 15:38 UTC ( #381053=perlmeditation: print w/ replies, xml ) Need Help??

In the current Agust 2004 issue of SysAdmin there is an article about one of my favorite subjects, i.e. remote table comparison applied to MySQL databases.

The source code is actually working code (not as documented as I would have liked it, though, but space was not unlimited) with which you can compare two tables in two remote databases and see wether they differ at all, and if they do, there are method to find out which rows are different, without copying the data from a host to the other.

How this works is minutely explain in the article, so I won't repeat it here. However, since the article deals mostly with database issues, I would like to spend some words on the supporting Perl code, which does the magic of creating the not-so-obvious SQL queries to carry out the task. I won't examine the most complex script, the one finding the detailed differences between tables, because I'd need to introduce too much background knowledge to explain them in full. Therefore, I would like to present a smaller function that will tell you if two tables are different at all, so you can take further action.

This is a slightly modified version of the source code for the same task published in the magazine. I am thinking about making a CPAN module with the whole thing, but not right now.

Let's start with the algorithm used.

In short, if you want to compare two large records, you can make a signature of the whole record, by joining together their fields and applying a CRC function, such as MD5 or SHA1.

Comparing a whole table is trickier, because in standard SQL, without stored procedures and cursors, obtaining the CRC of a range of records is far from trivial. One possibility, though, is to make a CRC for each record and then sum them up to get a result that can be consider the table signature.

Here's the code.

sub get_table_CRC { my $dbh = shift; my $tablename = shift; my $fields = get_fields($dbh, $tablename); my $check_table = qq[ SELECT COUNT(*) AS cnt, CONCAT(SUM(CONV( SUBSTRING(\@CRC:=MD5( CONCAT_WS('/##/',$fields)),1,8),16,10 )), SUM(CONV(SUBSTRING(\@CRC, 9,8),16,10)), SUM(CONV(SUBSTRING(\@CRC,17,8),16,10)), SUM(CONV(SUBSTRING(\@CRC,25,8),16,10)) ) AS sig FROM $tablename ]; # uncomment the following line to see the full query # print $check_table,$/; my ($count, $crc); eval { ($count, $crc) = $dbh->selectrow_array($check_table)}; if ($@) { return undef; } return [$count, $crc]; }

The SQL part is quite complicated by the fact that MySQL can't handle arithmetic operations with the kind of number that can result from a MD5 signature. To get over this problem, I split the MD5 string into four chunks, using the SUBSTRING function, and convert them from base 16 to base 10 using CONV. The result is a simple number that is passed to SUM. The MD5 is calculated once per record and assigned to a global MySQL variable (@CRC). It is the signature of a string composed by all fields in the record, with some adjustments to avoid NULL values to come into the equation.

Perl's biggest involvement in all this, apart form making the query, which is a bitch to create manually, is the get_fields function that reads the table structure and creates the list of fields to be be passed to MD5. If one field is nullable, a call to the COALESCE function will be used instead of its bare name.

sub get_fields { my ($dbh, $tablename) = @_; my $sth = $dbh->prepare(qq[describe $tablename]); $sth->execute(); my @fields=(); while (my $row = $sth->fetchrow_hashref()) { my $field ="`$row->{Field}`"; # backticks # if the field is nullable, # then a COALESCE function is used # to prevent the whole CONCAT from becoming NULL if (lc $row->{Null} eq 'yes') { $field = qq[COALESCE($field,"#NULL#")]; } push @fields, $field; } return join ",", @fields; }

With these two functions ready, we can actually run a test on two tables in two different hosts.

To be sure that the function works as advertised, this script will actually create the tables in both hosts. The first time this script runs, the newly created tables have the same contents, and the result will be "no differences". If you run it a second time, one record will be altered, just barely, and the result will be different.

#!/usr/bin/perl -w use strict; use DBI; # for this test, we create the database handlers directly # in the script my $dbh1 = DBI->connect('dbi:mysql:test;host=localhost', 'localuser', 'localpassword', {RaiseError => 1}); my $dbh2 = DBI->connect('dbi:mysql:test;host=192.168.2.33;port=13330', + 'remoteuser', 'remotepassword', {RaiseError => 1}); # this is the table to be created in both hosts my $tablename = 'testcrc'; my $create = qq{create table if not exists $tablename (i int not null, j int, a char(1), b float )}; my ($table_exists) = $dbh1->selectrow_array( qq{SHOW TABLES LIKE '$tablename'}); if ($table_exists) { # table exists. Let's make a tiny change $dbh1->do(qq{update $tablename set j = j-1 where i = 50}); } else # table does not exists. Create and populate { # create both tables $dbh1->do($create); $dbh2->do($create); my $insert = qq{insert into $tablename values (?, ?, ?, ?)}; my $sth1 = $dbh1->prepare($insert); my $sth2 = $dbh2->prepare($insert); # populates both tables with the same values for ( 1 .. 100 ) { $sth1->execute($_, $_ * 100, chr(ord('A') + $_), 1 / 3 ); $sth2->execute($_, $_ * 100, chr(ord('A') + $_), 1 / 3 ); } } my %probes; # gets the local table record count and CRC $probes{'local'} = get_table_CRC($dbh1, $tablename) or die "wrong info: $DBI::errstr\n"; # gets the remote table record count and CRC $probes{'remote'} = get_table_CRC($dbh2, $tablename) or die "wrong info: $DBI::errstr\n"; # Checks the result and displays print "LOCAL : @{$probes{'local'}}\nREMOTE: @{$probes{'remote'}}\n"; if ( ($probes{'local'}->[0] != $probes{'remote'}->[0]) or ($probes{'local'}->[1] ne $probes{'remote'}->[1]) ) { print "there are differences\n"; } else { print "NO DIFFERENCES\n"; }

The first run gives this result (the query is displayed only if you uncomment the relative print statement).

SELECT
    COUNT(*) AS cnt,
    CONCAT(SUM(CONV(
        SUBSTRING(@CRC:=MD5(
            CONCAT_WS('/##/',`i`,
                COALESCE(`j`,"#NULL#"),
                COALESCE(`a`,"#NULL#"),
                COALESCE(`b`,"#NULL#"))),1,8),16,10 )),
    SUM(CONV(SUBSTRING(@CRC, 9,8),16,10)),
    SUM(CONV(SUBSTRING(@CRC,17,8),16,10)),
    SUM(CONV(SUBSTRING(@CRC,25,8),16,10))
    ) AS sig
FROM
    testcrc

LOCAL : 100 211184068521202404576502196746869468230923726643
REMOTE: 100 211184068521202404576502196746869468230923726643
NO DIFFERENCES

The first number is a simple record count. The second one is a concatenated string from the four sums calculated on the MD5 chunks.

When we run the script for the second time, column "j" in record nr. 50 is increased by 1. This displays the following result:

LOCAL : 100 208246712599204490057913196197913536230317654094
REMOTE: 100 211184068521202404576502196746869468230923726643
there are differences

For the third run, we modify the source code and change "set j = j+1" to "set j = j-1". Again, the script reports that there are no differences.

You should notice that what we actually get from the get_table_CRC function is a few dozen bytes, even if the tables contain a million records. The database server may have some number crunching to perform, but you don't need to send gigabytes of data through the network.

This technique can save you hours of searching if you need to know if two remote tables have differences. More granularity is provided by the other functions described in the article.

Comments welcome

 _  _ _  _  
(_|| | |(_|><
 _|   

Comment on Remote MySQL table comparison
Select or Download Code
Re: Remote MySQL table comparison
by deliria (Hermit) on Aug 09, 2004 at 14:49 UTC

    One of the projects i've worked on the last 2 years involved a large (60m+ records) database which got replicated via a timed schedule in a master-slave setup. Sometimes for some reason the slave database would go out of sync.

    Having had this tool handy at the time would have saved me so many headaches. It's funny how simple a solution to a problem can be after it hit you full in the face.

    Thanks for pointing out this one!

    Kind regards,
    Deliria

[reply]
Re: Remote MySQL table comparison
by fab (Initiate) on Aug 25, 2004 at 12:07 UTC
    I've also read this paper with great interest.

    Although there are a few potential bugs in the algorithms and in the implementations presented, the idea is both simple and efficient.

    Thus I've implemented a new version which hopefully solves some of the weaknesses I found and has a better theoretical behavior. It is dedicated to PostgreSQL, but may be adapted to other databases.

    It is called pg_comparator, a tool for network and time efficient database table content comparison.

    see http://www.coelho.net/pg_comparator/ for the perl implementation.

[reply]
Reaped: Remote MySQL table comparison
by NodeReaper (Curate) on Jul 14, 2008 at 05:53 UTC
[reply]

Back to Meditations


Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: perlmeditation [id://381053]
Approved by jmcnamara
Front-paged by PodMaster
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others drinking their drinks and smoking their pipes about the Monastery: (8)
GrandFather
wfsp
atcroft
herveus
Eyck
biohisham
lamprecht
gnosti
As of 2009-11-21 09:06 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (729 votes), past polls