Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

In the current Agust 2004 issue of SysAdmin there is an article about one of my favorite subjects, i.e. remote table comparison applied to MySQL databases.

The source code is actually working code (not as documented as I would have liked it, though, but space was not unlimited) with which you can compare two tables in two remote databases and see wether they differ at all, and if they do, there are method to find out which rows are different, without copying the data from a host to the other.

How this works is minutely explain in the article, so I won't repeat it here. However, since the article deals mostly with database issues, I would like to spend some words on the supporting Perl code, which does the magic of creating the not-so-obvious SQL queries to carry out the task. I won't examine the most complex script, the one finding the detailed differences between tables, because I'd need to introduce too much background knowledge to explain them in full. Therefore, I would like to present a smaller function that will tell you if two tables are different at all, so you can take further action.

This is a slightly modified version of the source code for the same task published in the magazine. I am thinking about making a CPAN module with the whole thing, but not right now.

Let's start with the algorithm used.

In short, if you want to compare two large records, you can make a signature of the whole record, by joining together their fields and applying a CRC function, such as MD5 or SHA1.

Comparing a whole table is trickier, because in standard SQL, without stored procedures and cursors, obtaining the CRC of a range of records is far from trivial. One possibility, though, is to make a CRC for each record and then sum them up to get a result that can be consider the table signature.

Here's the code.

sub get_table_CRC { my $dbh = shift; my $tablename = shift; my $fields = get_fields($dbh, $tablename); my $check_table = qq[ SELECT COUNT(*) AS cnt, CONCAT(SUM(CONV( SUBSTRING(\@CRC:=MD5( CONCAT_WS('/##/',$fields)),1,8),16,10 )), SUM(CONV(SUBSTRING(\@CRC, 9,8),16,10)), SUM(CONV(SUBSTRING(\@CRC,17,8),16,10)), SUM(CONV(SUBSTRING(\@CRC,25,8),16,10)) ) AS sig FROM $tablename ]; # uncomment the following line to see the full query # print $check_table,$/; my ($count, $crc); eval { ($count, $crc) = $dbh->selectrow_array($check_table)}; if ($@) { return undef; } return [$count, $crc]; }

The SQL part is quite complicated by the fact that MySQL can't handle arithmetic operations with the kind of number that can result from a MD5 signature. To get over this problem, I split the MD5 string into four chunks, using the SUBSTRING function, and convert them from base 16 to base 10 using CONV. The result is a simple number that is passed to SUM. The MD5 is calculated once per record and assigned to a global MySQL variable (@CRC). It is the signature of a string composed by all fields in the record, with some adjustments to avoid NULL values to come into the equation.

Perl's biggest involvement in all this, apart form making the query, which is a bitch to create manually, is the get_fields function that reads the table structure and creates the list of fields to be be passed to MD5. If one field is nullable, a call to the COALESCE function will be used instead of its bare name.

sub get_fields { my ($dbh, $tablename) = @_; my $sth = $dbh->prepare(qq[describe $tablename]); $sth->execute(); my @fields=(); while (my $row = $sth->fetchrow_hashref()) { my $field ="`$row->{Field}`"; # backticks # if the field is nullable, # then a COALESCE function is used # to prevent the whole CONCAT from becoming NULL if (lc $row->{Null} eq 'yes') { $field = qq[COALESCE($field,"#NULL#")]; } push @fields, $field; } return join ",", @fields; }

With these two functions ready, we can actually run a test on two tables in two different hosts.

To be sure that the function works as advertised, this script will actually create the tables in both hosts. The first time this script runs, the newly created tables have the same contents, and the result will be "no differences". If you run it a second time, one record will be altered, just barely, and the result will be different.

#!/usr/bin/perl -w use strict; use DBI; # for this test, we create the database handlers directly # in the script my $dbh1 = DBI->connect('dbi:mysql:test;host=localhost', 'localuser', 'localpassword', {RaiseError => 1}); my $dbh2 = DBI->connect('dbi:mysql:test;host=;port=13330', + 'remoteuser', 'remotepassword', {RaiseError => 1}); # this is the table to be created in both hosts my $tablename = 'testcrc'; my $create = qq{create table if not exists $tablename (i int not null, j int, a char(1), b float )}; my ($table_exists) = $dbh1->selectrow_array( qq{SHOW TABLES LIKE '$tablename'}); if ($table_exists) { # table exists. Let's make a tiny change $dbh1->do(qq{update $tablename set j = j-1 where i = 50}); } else # table does not exists. Create and populate { # create both tables $dbh1->do($create); $dbh2->do($create); my $insert = qq{insert into $tablename values (?, ?, ?, ?)}; my $sth1 = $dbh1->prepare($insert); my $sth2 = $dbh2->prepare($insert); # populates both tables with the same values for ( 1 .. 100 ) { $sth1->execute($_, $_ * 100, chr(ord('A') + $_), 1 / 3 ); $sth2->execute($_, $_ * 100, chr(ord('A') + $_), 1 / 3 ); } } my %probes; # gets the local table record count and CRC $probes{'local'} = get_table_CRC($dbh1, $tablename) or die "wrong info: $DBI::errstr\n"; # gets the remote table record count and CRC $probes{'remote'} = get_table_CRC($dbh2, $tablename) or die "wrong info: $DBI::errstr\n"; # Checks the result and displays print "LOCAL : @{$probes{'local'}}\nREMOTE: @{$probes{'remote'}}\n"; if ( ($probes{'local'}->[0] != $probes{'remote'}->[0]) or ($probes{'local'}->[1] ne $probes{'remote'}->[1]) ) { print "there are differences\n"; } else { print "NO DIFFERENCES\n"; }

The first run gives this result (the query is displayed only if you uncomment the relative print statement).

    COUNT(*) AS cnt,
                COALESCE(`b`,"#NULL#"))),1,8),16,10 )),
    SUM(CONV(SUBSTRING(@CRC, 9,8),16,10)),
    ) AS sig

LOCAL : 100 211184068521202404576502196746869468230923726643
REMOTE: 100 211184068521202404576502196746869468230923726643

The first number is a simple record count. The second one is a concatenated string from the four sums calculated on the MD5 chunks.

When we run the script for the second time, column "j" in record nr. 50 is increased by 1. This displays the following result:

LOCAL : 100 208246712599204490057913196197913536230317654094
REMOTE: 100 211184068521202404576502196746869468230923726643
there are differences

For the third run, we modify the source code and change "set j = j+1" to "set j = j-1". Again, the script reports that there are no differences.

You should notice that what we actually get from the get_table_CRC function is a few dozen bytes, even if the tables contain a million records. The database server may have some number crunching to perform, but you don't need to send gigabytes of data through the network.

This technique can save you hours of searching if you need to know if two remote tables have differences. More granularity is provided by the other functions described in the article.

Comments welcome

 _  _ _  _  
(_|| | |(_|><

In reply to Remote MySQL table comparison by gmax

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2023-10-03 23:26 GMT
Find Nodes?
    Voting Booth?

    No recent polls found