RFC: Data::Sync

A couple of weeks ago I released Data::Sync. The basic idea behind the module is to construct a simple open source metadirectory. This makes 'copy from one database to another' applications very quick and easy to develop. An example is set out below:

You have two databases as follows:

Database1:

NAME VARCHAR(30), POSTALADDRESS VARCHAR(50), TELEPHONE VARCHAR(15), CUSTOMERNO NUM

Database2:

FULLNAME VARCHAR(30), ADDRESS VARCHAR(100), HOMEPHONE VARCHAR(20), OFFICEPHONE VARCHAR(20)

and you want to copy name, address and phonenumber for those entries from database1 whose customer number is less than 1000. The code would look something like this (assuming SQLite for convenience)

use strict;
use Data::Sync;
use DBI;

my $db1 = DBI->connect("DBI:SQLite:dbname=db1");
my $db2 = DBI->connect("DBI:SQLite:dbname=db2");
my $sync = Data::Sync->new();

$sync->source($db1,{select=>"SELECT NAME,POSTALADDRESS,TELEPHONE FROM 
+sourcetable WHERE CUSTOMERNO < 1000"});
$sync->target($db2,{table=>"targettable",index=>"FULLNAME");
$sync->mappings(NAME=>"FULLNAME",
        POSTALADDRESS=>"ADDRESS",
        TELEPHONE=>"HOMEPHONE");
$sync->run();
[download]

This will copy all records matching the select statement from the source database to the target database. If the entry exists, it will be updated - if it doesn't exist it will be created.

Nothing is done until the 'run' method is called - that calls the read, mappings, buildattributes, transforms, and write methods in turn (see below for transforms and buildattributes).

But perhaps the name attribute is formatted differently between the databases. Perhaps db1 has "Firstname Lastname", whereas db2 has "Lastname, Firstname". You can overcome this with a transformation:

$sync->transformations(FULLNAME=>'s/(\w*?)\s+(\w*?)/$2,$1/');
[download]

(You can pass a single quoted string representation of a regex, a coderef or a string in here - strings are for existing defined transformations like 'stripnewlines', 'stripspaces' etc - see the perldocs. Note that 'transforms' acts on the MAPPED names of the attributes - i.e. the names in the target, not the names in the source).

You can also specify an LDAP handle rather than a DBI handle for source or target, in which case the syntax changes to reflect the difference. The perldocs detail this (and it's more specialised, so I've omitted discussion here), but there's one point that's worth illustrating:

Writing to an LDAP directory requires a DN (Distinguished Name) to identify the record, and an objectclass. Reading from DBI to LDAP, you're unlikely to have either, so the 'buildattributes' function allows you to create them:

$sync->buildattributes(dn=>'cn=%NAME%, ou=container,dc=testorg,dc=net'
+,
            objectclass=>'organizationalPerson');
[download]

Once created, you can use transformations on this attribute like any other, so more complex processing can be done.

Rationale

This is all analogous to a number of commercial products: Critical Path IMD, Maxware DSE, IBM Metamerge among others. Implementing these types of systems is how I normally earn my keep. I figured that given my fondness for perl I was going to end up implementing something like this eventually, so decided to do it while I have free time between projects - neatly avoiding implementing it in a project and hence being unable to release/reuse it. The upshot of that is that unlike many CPAN modules, this code, although thoroughly tested with Test is not in live use. AFAIK anyway - someone might have adopted it in the last couple of weeks I suppose. There are undoubtedly criticisms and comments that can be raised against the code as well as the syntax etc. All comments are welcome.

I'm currently at work on a GUI for job definition, and a standalone "run this saved job" script to allow sync jobs to be created and used without writing any perl code, but I'm polling at this point for opinions and feedback - any functionality people would like to see included or any other comments you may have?

--------------------------------------------------------------

$perlquestion=~s/Can I/How do I/g;

Comment on RFC: Data::Sync Select or Download Code

Replies are listed 'Best First'.
Re: RFC: Data::Sync by dragonchild (Archbishop) on Sep 22, 2005 at 16:00 UTC
A few notes: Your distribution doesn't provide a MANIFEST, META.yml, or Build.PL. Since Module::Build will create a standalone Makefile.PL for you using the `create_makefile_pl => 'traditional'` option, there's no reason not to use it. If someone set RaiseError on their $dbh, you will throw all those errors. Wouldn't it make more sense to wrap everything in a eval-$@ block? You cannot depend on the return value of `$sth->execute()`. According to the DBI documentation, The execute method does not return the number of rows that will be returned by the query (because most databases can't tell in advance), it simply returns a true value. You explictly check for '0E0' where you should be checking for simple truth. You make the same mistake re: return value of execute() in your write function. The cost of doing a UPDATE then an INSERT can become prohibitive. You should look at DBD-specific options, such as REPLACE for MySQL. Every RDBMS in existence has a way of doing bulk-pulls and bulk-loads from some xSV file. It might be worth your while to exploit that if you know your source and destination are both RDBMSes. My experience has been that bulk-loads perform over 100x faster than DBI queries. You can always use File::Temp for temporary filehandles. Your code currently just makes sure that everything in A is in B. You don't validate the other way around. This may be by design, but "synchronization", to me, means that A and B will provide the same information. Something akin to MySQL's replication would be more synchronization, to me. A design note - I would have gone about it a different way: Define an API that every class wrapping a datastore must adhere to. write(), read(), temporary files, etc. Provide a class that implements that API for every type of datastore supported. For example, LDAP, DBI::generic, DBI::mysql, etc. This would simplify testing and implementation. Plus, others can provide optimized implementations for their favorite datastore and you don't have to change the controller. In fact, I could write one for my specific needs and not have to give it back to the community (if, for instance, my employer is an @$$hole who doesn't believe in contributing to OSS but doesn't mind using it). Currently, you have to modify the main module, which is fraught with danger. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l] [select]
Re: RFC: Data::Sync by jZed (Prior) on Sep 22, 2005 at 14:15 UTC
Looks very useful, gOn++. Am I right in thinking that this only applies to DBI? If so, I'd really suggest a name in the DBIx::* hierarchy since Data:: applies, in my mind to all kinds of data, not just DBI. I suggest you drop a note to dbi-users@perl.org mentioning the url of this node so the folks over there can get in on the RFC. Also, you should make it explicit what happens in these cases: a) when you try to merge a column of one type in a db with a column of another type in the second db and b) when the length of the data in one db is different from that in the second e.g. you try to put a VARCHAR(50) value into a VARCHAR(10) column. Also, is this meant to only work on two database in the same RDBMS or is it meant to be able to go, e.g. from a MySQL database to a Pg database?	[reply]
Re^2: RFC: Data::Sync by g0n (Priest) on Sep 22, 2005 at 14:28 UTC
thanks :-) It doesn't only apply to DBI: it currently only works with DBI & Net::LDAP handles (and the two are handled differently by the code). I did think about a DBI::* namespace, but as well as also supporting LDAP it's intended to develop to cover as many data source types as possible. I'm open to rename suggestions, but I wanted to use as descriptive and general a name as possible. /me slaps forehead - behaviour isn't defined for length and type mismatches. Those are going on the TODO list now and will be addressed ASAP. In theory you should be able to go from one RDBMS type to another - I've run from SQLite->LDAP, LDAP->SQLite, and LDAP->DBD::CSV so far. -------------------------------------------------------------- $perlquestion=~s/Can I/How do I/g;	[reply]
Re: RFC: Data::Sync by diotalevi (Canon) on Sep 22, 2005 at 14:46 UTC
I skimmed your code and wanted to pick some nits with you. None are serious. The rest of your code was too sparse to skim and I couldn't read it. If you plan to be backwards compatible with 5.005, you do not `use warnings` and you `use vars` instead of `our()`. If you are going to be 5.6+, you may use both. Doing only one is strange and it looks like you don't understand the reasons for doing things one way or the other. `use warnings; use vars qw($VERSION); $VERSION="0.02";` [download] Your warning is technically correct but irrelevant. The order of keys() always matches the order of values(). Even in 5.8.x+ with key order randomization, this is true. This can be rewritten as two simple statements. `# DANGER! this initially used keys %$line, but order can't be guarante +ed. # my (@keys,@values); # for (keys %$line) # { # push @keys,$_; # push @values,$$line{$_} # } my @keys = keys %$line; my @values = map $$_, values %$line;` [download] Instead of transforming a list of indexes into a list of "?" characters, its easier to just create that many "?" in one operation. Its faster and wastes less memory too. The parentheses are important. It means you get back a list of "?" instead of a string with many "????????" in it. `# map { "?" } (0..scalar @values-1) ( "?" ) x @values` [download]	[reply] [d/l] [select]
Re: RFC: Data::Sync by diotalevi (Canon) on Sep 22, 2005 at 14:28 UTC
This is completely missing timestamped checks. The database can be timestamped so you know whether to look deeper or not. Same with the tables. Same with the row. Then the values are marked with sequence numbers so you know which ones have been updated and which ones haven't. This gives you the possibility of handling a field-level change to the same row in multiple copies. You'd need to make notes somewhere about stuff that'd been deleted. More timestamps. Further, using timestamps allows a person to work on only the changed portion. This is of prime importance when the data set is too large to do a complete compare in a reasonable way or when it is too large to comfortably hold in memory. So what's your method for comparing? Something approximating `foreach ( @src ) { $tgt = fetch_this( $_->id ); update() if $_ ne $tgt }` is clearly not goot enough for anything except very small data sets or people with memory and bandwidth to burn.	[reply] [d/l]
Re: RFC: Data::Sync by xdg (Monsignor) on Sep 22, 2005 at 14:31 UTC
Definitely interesting and worthwhile. The LDAP capability is interesting -- I suspect most other code like this is really specialized towards databases. These are the questions/comments that come to mind: How does this compare to other modules of this type? (Quick CPAN search reveals DBIx::Migrate and Class::DBI::DataMigration) I agree that "Data::Sync" doesn't seem quite right. First, the name is extremely general, whereas the module is specific to databases and LDAP, not any type of data. Second, it does a one-way merge-update, which is only a narrow, almost colloquial meaning for "sync". For "sync" I'd imagine something that was bi-directional and helped manage conflicts. What kind of performance does it have for large datasets? -xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply]
Re: RFC: Data::Sync by nothingmuch (Priest) on Sep 22, 2005 at 22:32 UTC
http://www.cis.upenn.edu/~bcpierce/harmony/ -nuffin zz zZ Z Z #!perl	[reply]
Re: RFC: Data::Sync by mantadin (Beadle) on Apr 25, 2006 at 14:53 UTC
Hi g0n, seems to be a very useful thing ... . As far as I understood it yet, this module helps to copy complete entries from one database to another - now I wonder if it can be used (or extended) to update a database from the changelogs of another one. Consider this scenario: I have two ldap directories with different schemas and want to sync changes from one to another the "source ldapd" writes ldif files containing the changes made to the entries of it's DIT I want to apply the changes (and only the changes) to the target ldap-directory (doing the schema map to the entries before updating them to the target so that would yield a connector, reading the changelogs with tail -f and feeding the mapped changes constantly, as a demon, to the target directory Before I learned of your module, I've found, that Net::LDAP can read ldif, including the format for changing entries - so currently I create Net::LDAP::Entry objects with Net::LDAP::LDIF and then apply my mapping rules to them, finally updating them to the target directory. I wonder if Your module can be used to something like that? Did You think of such a use case?	[reply]
Re^2: RFC: Data::Sync by g0n (Priest) on Apr 25, 2006 at 15:21 UTC
I didn't implement direct changelog reading, because there are multiple approaches to changelogs: OpenLDAP uses a changelog file, whereas SunOne/iPlanet & various others implement 'cn=changelog' objects in directory root, although their formats vary. As far as I know, the recommended approach to change detection in LDAP these days (since no two directory vendors ever managed to agree on a changelog standard) is persistent searching. There are two ways of doing this using Data::Sync as it stands: Run a persistent search against your source DSA, detecting those changes you're interested in and flowing them. Run a periodic search of all the objects you are interested in, and hash them for changes within Data::Sync. The latter approach has been tested, but is a hefty search overhead. The former works in theory, but hasn't been tested (I have only very recently got a fully v3 compliant DS set up for testing). An alternative approach (and the one I would favour for simplicity I think) would be to read LDIF as if it were an LDAP server - the code extensions to do that are fairly simple. That way your code could read the changelog file to pick up changes, perform the remapping, and write them direct to the target DS. The code changes for that are fairly straightforward, and would be very useful in any case - thanks for the suggestion. -------------------------------------------------------------- "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing." John Brunner, "The Shockwave Rider". Can you spare 2 minutes to help with my research? If so, please click here	[reply]
Re^3: RFC: Data::Sync by mantadin (Beadle) on Apr 26, 2006 at 09:57 UTC
Thank you very much for the quick reply. You wrote "An alternative approach (and the one I would favour for simplicity I think) would be to read LDIF as if it were an LDAP server". I will try it that way and post again, as soon as I have some useful results.	[reply]
Re: RFC: Data::Sync by g0n (Priest) on Sep 25, 2005 at 12:36 UTC
Thanks for all your responses, especially Dragonchild and diotalevi for taking the extra trouble to look at the source. The comments and suggestions have really helped to focus and prioritize further development. xdg Here's the DProf output for a sample run on my laptop - a 3 field database in SQLite with 20,000 records: Total Elapsed Time = 267.0170 Seconds User+System Time = 202.9570 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 92.8 188.4 188.40 39992 0.0047 0.0047 DBI::st::execute 1.71 3.470 201.47 20002 0.0002 0.0101 Data::Sync::__ANON__ 1.45 2.950 2.950 39992 0.0001 0.0001 DBD::SQLite::st::_prepare 1.02 2.070 2.070 39994 0.0001 0.0001 DBI::_setup_handle 0.78 1.590 1.590 139990 0.0000 0.0000 Data::Sync::log 0.66 1.340 3.670 39994 0.0000 0.0001 DBI::_new_handle 0.29 0.590 4.260 39992 0.0000 0.0001 DBI::_new_sth 0.29 0.580 7.790 39992 0.0000 0.0002 DBD::SQLite::db::prepare 0.25 0.500 8.290 39992 0.0000 0.0002 DBI::db::prepare 0.21 0.420 0.710 1 0.4200 0.7100 Data::Sync::remap 0.14 0.280 0.280 79986 0.0000 0.0000 DBI::common::DESTROY 0.13 0.260 0.260 39994 0.0000 0.0000 DBI::st::TIEHASH 0.13 0.260 0.870 1 0.2600 0.8699 Data::Sync::sourceToAoH 0.11 0.230 0.410 20000 0.0000 0.0000 Data::Sync::recursiveTran +sform 0.11 0.220 0.630 1 0.2200 0.6300 Data::Sync::runTransform [download] as you can see, it wasn't particularly quick, but the bulk of the time was spent in DBI::execute. -------------------------------------------------------------- $perlquestion=~s/Can I/How do I/g;	[reply] [d/l]

Back to Meditations