Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Compare/Diff on nested data structures

by cosmicperl (Chaplain)
on Mar 21, 2009 at 22:35 UTC ( [id://752303]=perlquestion: print w/replies, xml ) Need Help??

cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I've been searching CPAN for a solution but with no luck. Data::Compare will compare the 2 but only give me a true/false as to if they match or not.
I need a new data structure returned that represents the differences between the two. Before I go ahead and write one (probably upload to CPAN as well) does anyone know of something that already exists?


Lyle

Update: I started to write a module, when coming up with a name I came across Data::Diff. Which looked like exactly what I needed, but unfortunately on testing it doesn't maintain the order or arrays. Such as it would class [1,2,3] and [3,2,1] the same.

Replies are listed 'Best First'.
Re: Compare/Diff on nested data structures
by ELISHEVA (Prior) on Mar 22, 2009 at 04:40 UTC

    I would not advise using Data::Diff - in addition to the problem that you have noticed, it doesn't handle circularities in data structures (always a risk in nested data structures)

    You may want to try Test::Deep. If you don't want to include the Perl testing framework (e.g. Test::More etc) it has a version called Test::Deep::NoTest. It should give you detailed output on the differences between two structures and plays nicely with hashes, arrays, objects, bags, and a host of other data structure variants.

    It uses a lot of exports and syntactic sugar. If that is not to your taste, you could always write some wrapper functions around it. Place those functions in their own module, and use Test::Deep (or Test::Deep::NoTest only in that module. It would be a lot easier than writing your own module. As has already been stated, writing general purpose data exploration algorithms is non-trivial.

    This is one of those problems that looks simple, but isn't. Aside from the points already mentioned by grandfather and graff, a general purpose data comparison algorithm has to (a) detect and gracefully handle circularities in a data graph (b) decide whether or not to treat object references as opaque (c) call a list of selected methods (possibly with parameters) to compare opaque objects.

    If you are using the comparison for business or scientific reports, a hand-rolled data comparison module would also have to take into account several questions about the interpretation of differences. Do all keys/array elements matter? They might not if some of the data is transient and you only care about the part that persists when you dump things out to YAML or XML. Does order of elements in an array matter? If the array is used to implement a "bag" it shouldn't. If hash keys are missing, does that matter? Should hash keys be compared if one of the hashes assigns undef? If you only want to compare "known" or "applicable" data, then probably not.

    If Test::Deep doesn't do what you want, perhaps you could describe the specific goals and data structures for which you want to do comparisons. CPAN has many, many data comparison modules for specific kinds of data structures. With a bit more focus to your question, we might be able to suggest one of them.

    Best, beth

      Just had a go with Test::Deep::NoTest, unfortunately cmp_deeply doesn't work with NoTest (got errors), only eq_deeply which isn't any different to Data::Compare...


      Lyle
Re: Compare/Diff on nested data structures
by GrandFather (Saint) on Mar 21, 2009 at 22:58 UTC

    Actually, what you are really want is diff rather than compare, but it is a non-trivial problem.

    I suspect that the reason you are having trouble finding anything is that the way differences are represented are likely to be highly application specific. While a pass/fail (compare) test is fairly unambiguous with a result that is easy to interpret, how are you going to represent the difference between two data structures? As a list of insertions/deletes to be performed on one that will result in the other? As a list of nodes that are present in one, but not the other? As two lists of missing nodes, one for each structure? Something else?


    True laziness is hard work
      Actually, what you are really want is diff rather than compare, but it is a non-trivial problem.
      That was implied in the title :)
Re: Compare/Diff on nested data structures
by graff (Chancellor) on Mar 22, 2009 at 01:59 UTC
    As GrandFather said, you are looking at non-trivial task. There are so many ways in which two "nested data structures" can differ, you would need to break the task down into stages:
    1. The "root" data types of the two structures have to be the same (can't compare an array to a hash)
    2. For the top-level unit, determine if the number of elements is same or different; for hashes, also determine which keys they have in common and which keys are unique to each hash
    3. Compare the values of the common elements:
      1. If the Nth elements in both arrays (or the values for key "X" in both hashes) are both scalars, do the two scalars match or not?
      2. If only one is a scalar and the other is a reference, how do you want to view that?
      3. If both are references, do they refer to the same type of thing?
        1. If so, you would probably check to see if they both actually point to the same memory address, and if they don't, recurse to compare the contents of each reference.
        2. If they point to different kinds of things, how do you want to view that?
    4. Figure out how you want to view the elements that are unique to one or the other structure, given that these things themselves could be arbitrarily complex.

    Or, maybe you just need to be more explicit about the kinds of things you really want to compare (e.g. only isomorphic structures, where the possible differences are limited to "leaf-node" scalar values, or some similar constraint), and what purpose(s) would be served by your enumeration of differences (because the form of results should follow/support the intended function).

Re: Compare/Diff on nested data structures
by NiJo (Friar) on Mar 22, 2009 at 17:01 UTC
    'Representing the differences' is not overly specific on the output side of things, espescially as two values must go where the original has only one. A viable solution depends on the consumer of your data.

    If it is for humans, you could be done quickly by 'diff'ing output of Data::Dumper.

    For automated processing you could walk the structures from bottom to top and prune matching data on both sides. Your application most likely already has a walking algorithm for the specific structure. The result is two smaller structures.

    Solving the general case without knowledge of the structure is a lot more difficult.

      Below is the code to produce difference -- with help of diff(1) -- in two scalars, to be understood by a human (well, at least the ones who can read diff(1) output).

      package DumpDiff; # Usage: # # use DumpDiff qw[ dump_diff ]; # print dump_diff( [ 0 .. 5 ] , [ 3 .. 12 ] , "-u -U 8"); use warnings; use strict; use Exporter 'import'; BEGIN { sub dump_diff; our @EXPORT = qw[ dump_diff ]; } use Carp qw[ croak ]; use Data::Dumper; use Fatal qw[ :void open close ]; use File::Temp; # Pass in two scalar to get the difference. Optional third parameter # is passed as is to diff(1), instead the default of "-uBb". # # diff(1) output is returned as either a list or a string based on # calling context. sub dump_diff { my ( $one , $two , $opt ) = @_; local $Data::Dumper::Sortkeys = 1; local $Data::Dumper::Indent = 1; local $Data::Dumper::Deepcopy = 1; my $oh = File::Temp->new; print $oh Dumper( $one ); close $oh; my $th = File::Temp->new; print $th Dumper( $two ); close $th; my $one_dump = $oh->filename; my $two_dump = $th->filename; $opt ||= '-uBb'; my @diff = qx{diff $opt $one_dump $two_dump}; if ( $? == 0 && !$! ) { return; } elsif ( $! ) { croak "cannot run 'diff $opt $one_dump $two_dump'': $!"; } return wantarray ? @diff : join '' , @diff ; } 1;
        It might be useful to use Data::Dumper as text output for each comparable data structure, and do an actual unix-diff on each. This has the limitations of 1) treating both objects and code references as opaque, 2) relying on a unix command.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://752303]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-05-26 06:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found