Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Verify database consistency after merge

by andreas1234567 (Vicar)
on Oct 16, 2006 at 19:40 UTC ( #578584=perlquestion: print w/replies, xml ) Need Help??
andreas1234567 has asked for the wisdom of the Perl Monks concerning the following question:

This may seem a bit off-topic, but as I'm thinking perl when I try to find a way to solve this, here's the question:

I have two (MySQL) database tables containing financial transaction records. I wish to merge them into a single table, mainly for ease of use and maintenance reasons, but also since there's no (longer any) good reason this information should be split into two tables at all. There are about 40M records in each table. I need to be able to determine that the conversion preserved consistency, i.e. that the information logically remains exactly the same.

Has anybody faced similar challenges? I'm thinking in terms of script(s) involving Test::Harness which would magically spit out ok/nok would be very nice.

-- Two original tables create table a ( id int(11) unsigned not null, account_no bigint(26) unsigned not null, unique key id_idx (id) ); create table b ( id int(11) unsigned not null, balance double not null, unique key id_idx (id) ); -- New table create table c ( id int(11) unsigned not null, account_no bigint(26) not null, balance float not null, unique key id_idx (id) ); insert into a values (1,1); insert into a values (2,2); insert into b values (1,100.0); insert into b values (2,200.0); -- Populate c with values from a and b insert into c ( id, account_no, balance) select, a.account_no, b.balance from a, b where; select * from c; +----+------------+---------+ | id | account_no | balance | +----+------------+---------+ | 1 | 1 | 100 | | 2 | 2 | 200 | +----+------------+---------+ 2 rows in set (0.00 sec)
By consistency I mean that for every id the queries below should return the same result set.
select, a.account_no, b.balance from a, b where; select id, account_no, balance from c where id=X;
Do you still consider advice to leave it as is? Andreas

Replies are listed 'Best First'.
Re: Verify database consistency after merge
by graff (Chancellor) on Oct 17, 2006 at 02:03 UTC
    Following in the same vain as astaines, you should (ideally) have fairly tight definitions for "merge" and "preserve consistency". What sort of arithmetic are you expecting from the merger?

    • The two tables are storing slightly different information on the exact same set of 40M financial transactions -- that is, they contain the same number of rows, and so will the merged table.


    • The two tables are storing the same information on two different sets of transactions -- they probably have different row counts, and the row count of the new table will be the sum of row counts from the two original tables.


    • The two tables store the same information, but there is some overlap in their content, so the row count of the new table will be the sum of: row count of records that are common to both input tables, plus row count of records unique to table 1, plus row count of records unique to table 2.

    Any other situation (i.e. table structure is different in the two tables, and some or all records are unique to each table) will be a mess for merging, unless you can somehow compute, deduce, recover, or otherwise fill in the fields that are unique to table 1 for rows that come from table 2, and vice-versa.

    In the simplest case (table structures are identical), this is not a perl question. You either pick one table to be the "keeper" (or define a new table to be the "union"), and insert into that table from the other(s), using the extra clause "ON DUPLICATE KEY UPDATE ..." at the end of the insert statement.

    In the less simple case (table structures are different, but the two input tables are storing the exact same set of transactions), it's still a pure SQL operation -- create a new "merge" table with the union of all columns from the other two, and do a join of those two in order to select into the new one.

    In the nasty case (different structures and different sets of transactions)... well, good luck with that. I'd say you would need some sort of documented, critiqued and approved statement of work, which will (a) justify your conclusion that "there's no (longer any) good reason" to maintain two tables, and (b) clearly define the conditions (procedures and test protocols) that constitute a successful merge. Perl scripting is bound to help for actually getting the work done, once the task is fully specified.

      Excellent analysis and summary. If it is the nasty case, then the suggestion to do nothing seems the right option.
Re: Verify database consistency after merge
by astaines (Curate) on Oct 16, 2006 at 20:57 UTC

    It depends on what you mean by consistency. Possible ideas include the following :-

    • Completely reversible - you can get out what you put in
    • All exisitng queries give exactly the same answer with the new structure
    • The two data structures are formally equivalent to the old one in the sense that all true (and false) statements about the original two files remain true (and false) for the new file

    This is tricky to say the least, and requires a lot of consideration. My advice would be simple, unless you have to do this, don't!

    -- Anthony Staines
Re: Verify database consistency after merge
by tcf03 (Deacon) on Oct 17, 2006 at 03:33 UTC
    you could just get md5s before and after. I do something similar with two databases that are supposed to be the same. Unfortunately one is informix and one is sql server

    so I bind the columns to hashes, and feed the hashref and the db connection off to a sub which gives me hashes sums

    ...snip lots of other code...
    use strict; # the obligitory use warnings; # use strict and warnings use Digest::MD5; my $makedigest = \&digestit; my %ldb = $makedigest->(\%lemployee, $lsth); my %rdb = $makedigest->(\%remployee, $rsth);
    ...snip lots of other code...
    sub digestit { my $dbhash = shift; my $sth = shift; my %newdbhash = (); while ( $sth->fetchrow_arrayref ) { my $keydata = Digest::MD5->new; for my $key ( keys %{$dbhash} ) { %{$dbhash}->{$key} = '' unless ( defined(%{$dbhash}->{$key}) ); } $keydata->add(%{$dbhash}->{$_}) for @key; my $digest = $keydata->hexdigest; $newdbhash{$digest} = {%{$dbhash}}; } return %newdbhash; }
    My main reason for doing it this way is that I have NO unique key, so I rely on several columns together to provide me with a unique key. Because the above code is incomplete its probably useless, but the concept is there. Once you have your two hashes ( before and after ) compare the sums.
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson
Re: Verify database consistency after merge
by jeanluca (Deacon) on Oct 16, 2006 at 20:32 UTC
    If the tables have a simular structure, just use a merge-table!

Re: Verify database consistency after merge
by bduggan (Pilgrim) on Oct 17, 2006 at 13:26 UTC
    I'm faced with a similar task -- a radical transformation of the way data is represented that in the end should yield equivalent result sets from very different queries. I've found Test::Differences to be very handy. (In fact one of the first examples for that module is comparing result sets.)
Re: Verify database consistency after merge
by nmerriweather (Friar) on Oct 17, 2006 at 15:34 UTC
    Two IMPORTANT things you should note:

    a- Take this time to consider migrating to Postgres . I would never trust anything financial or critical to MySQL. Their integrity blows.

    b- If you are staying in MySQL, MAKE SURE that you are using AT LEAST v5.0 so you can enable 'TRADITIONAL' mode.

    TRADITIONAL mode does some crazy things-- like require text to be valid for the field you stuff it in. Otherwise, mysql will start doing things like trimming and inserting the largest substring of your field it can fit into a column, without calling an error. ( real databases call fatal errors on stuff like that ). Mysql will also change data types without telling you, along with a long list of horrendous things. It was built for speed and widespread use-- not for correctness of data.
      Thanks for the advice regarding Postgres.

      I'm aware that there are many issues with MySQL. But as with any tool, used appropriately, it works fine for this particular application. I perform analysis on the data and are much more dependent on small data size and raw speed than transactional behavior. MySQL has proven ideal for this.
Re: Verify database consistency after merge
by arc_of_descent (Hermit) on Oct 17, 2006 at 17:44 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://578584]
Approved by Corion
Front-paged by astaines
[marto]: Morning, looks like our client office will be closed due to bad weather

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (8)
As of 2018-01-17 07:51 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (196 votes). Check out past polls.