Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Reconciling multiple lists (similar to "merge" in CVS?)

by japhy (Canon)
on Jan 05, 2007 at 22:33 UTC ( #593215=perlquestion: print w/replies, xml ) Need Help??
japhy has asked for the wisdom of the Perl Monks concerning the following question:

Let's say I have a three lists of strings:
my @sets = ( [qw( a c f g e h i )], # X [qw( a b c d e f g h i )], # Y [qw( a c d e f g h i )], # Z );
What I'd like to accomplish is to display these side-by-side, probably in the following manner:
X Y Z a a a b c c c d d e e f f f g g g e h h h i i i
To do this, I need to reconcile the discrepancies between the lists and come up with an array of arrays holding the indexes. In this example, the index array would be:
my @idx = ( [0, undef, 1, undef, undef, 2, 3, 4, 5, 6], # X [0, 1, 2, 3, 4, 5, 6, undef, 7, 8], # Y [0, undef, 1, 2, 3, 4, 5, undef, 6, 7], # Z );
I've chosen the order because sets Y and Z have the ordering of (d, e, f, g) in common, and set X is the odd one with (f, g, e).

So my question is, is this a job for a diff-like module, or some CVS-merge-like module, or what? The idea is to show the lists (which are very similar but probably differ in a couple places here and there) side-by-side and match them up as much as possible without altering their order. Or, put another way, I want to insert "null" elements into the lists as necessary to get them to line up such as I have them displayed.

Update: another problem I might run into is each set having a unique element:

my @sets = ( [qw( a b c f g )], # X [qw( a b d f g )], # Y [qw( a b e f g )], # Z );
resulting in:
X Y Z a a a b b b c d e f f f g g g

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re: Reconciling multiple lists (similar to "merge" in CVS?)
by belg4mit (Prior) on Jan 05, 2007 at 23:24 UTC
    I vote diff like module. Also, c, d and e on separate lines is the Right Way IMHO (they aren't the same thing afterall).

    --
    In Bob We Trust, All Others Bring Data.

      It looks helpful, especially the tabular output format. But it only handles two lists at a time. I'm trying to determine how to take DIFF(X,Y), produce two new modified lists, and then do something with those and Z to balance all three.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
        Text::Diff3, though it's hard to tell from the POD how useful it is. Algorithm::Merge looks promising though.

        UPDATE: Also, you'll have to bug tye to add diff3 support to Algorithm::Diff, or the Text::Diff author to alternately use Algorithm::Merge.

        --
        In Bob We Trust, All Others Bring Data.

Re: Reconciling multiple lists (similar to "merge" in CVS?)
by jdporter (Canon) on Jan 05, 2007 at 23:45 UTC

    The hard part in what you're talking about is determining an appropriate display order. Your example illustrates that the non-trivial case can occur. Therefore, you probably want a topological sort (e.g. by Sort::Topological or the topological_sort method of Graph) to generate an optimal order. Once you've done that, the rest is a cinch.

    But just having topological sort doesn't solve all the problems. You first need to generate an appropriate DAG based on the input lists. You could build a graph, adding each input list, such that each edge in the input list increments the corresponding edge in the graph. When that's done, invert the weight (count) of each edge, and extract a minimum spanning tree (also provided by Graph) from the graph. Use that as the input to the topological sort.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
Re: Reconciling multiple lists (similar to "merge" in CVS?)
by rinceWind (Monsignor) on Jan 05, 2007 at 22:51 UTC

    Have a look at VCS::Lite. This might suffice, but I'm not sure exactly what you want.

    The module is normally used to work with files of lines of text, but you can just as easily construct VCS::Lite objects from an array.

    --

    Oh Lord, won’t you burn me a Knoppix CD ?
    My friends all rate Windows, I must disagree.
    Your powers of persuasion will set them all free,
    So oh Lord, won’t you burn me a Knoppix CD ?
    (Missquoting Janis Joplin)

Re: Reconciling multiple lists (similar to "merge" in CVS?)
by shigetsu (Hermit) on Jan 06, 2007 at 00:21 UTC
    To do this, I need to reconcile the discrepancies between the lists and come up with an array of arrays holding the indexes.
    Have a look at List::Compare. Following code does build a hash of arrays holding the indexes (you may need to replace the ' ' string (being used for undef values) with undef; I used it in favor of undef, because undef would garble the output). Basically, it should work for an infinite amount of sets.
    #!/usr/bin/perl use strict; use warnings; use List::Compare; my (%have, @seen, %is_unique); my @sets = ( [qw( a c f g e h i )], # X [qw( a b c d e f g h i )], # Y [qw( a c d e f g h i )], # Z ); my $lcm = List::Compare->new(@sets); foreach my $set (@sets) { my %seen = map { $_ => 1 } @$set; push @seen, \%seen; } foreach ($lcm->get_intersection) { $have{$_} = [ 0..$#sets ]; } my %index = map { $_ => undef } 0..$#sets; foreach my $i (0..$#sets) { foreach my $elem ($lcm->get_unique($i)) { my @list; foreach (keys %index) { $list[$_] = $index{$_} } $list[$i] = $i; $have{$elem} = [ @list ]; $is_unique{$elem}++; } } foreach my $elem ($lcm->get_nonintersection) { next if $is_unique{$elem}; my @list; foreach (keys %index) { $list[$_] = $index{$_} } foreach my $i (0..$#sets) { $list[$i] = $i if $seen[$i]{$elem} } $have{$elem} = [ @list ]; } foreach (sort keys %have) { foreach my $i (0..@{$have{$_}}) { print defined $have{$_}->[$i] ? "$_ " : ' '; } print "\n"; } __OUTPUT__ a a a b c c c d d e e e f f f g g g h h h i i i
      It looks close, but it's just telling me what lists have what elements, unless I'm mistaken. I don't want to re-arrange the elements in set X -- they're out of order and should stay that way.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Reconciling multiple lists (similar to "merge" in CVS?)
by Util (Priest) on Jan 05, 2007 at 23:20 UTC
      What do you mean by "sorted"? My example shows that one list might have elements in a different order.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

        Ack! I overlooked the unsorted elements in your data. By way of apology, please accept this working, tested code. It is somewhat under-documented and needs refactoring, but I am out of time tonight. For your two sample @sets, my code outputs:

        [ 0, undef, 1, undef, undef, 2, 3, 4, 5, 6 ], [ 0, 1, 2, 3, 4, 5, 6, undef, 7, 8 ], [ 0, undef, 1, 2, 3, 4, 5, undef, 6, 7 ],
        and:
        [ 0, 1, 2, undef, undef, 3, 4 ], [ 0, 1, undef, 2, undef, 3, 4 ], [ 0, 1, undef, undef, 2, 3, 4 ],
        .

        The code works by transforming all the lists into AoAs, with each element holding [$original_item, $original_index]. It merges the first two lists, via Algorithm::Diff (++belg4mit) to form elements [$original_item, $index_from_list1, $index_from_list2]. List 3 is then merged in, etc. I *think* it will work fine for *any* number of lists.

Re: Reconciling multiple lists (similar to "merge" in CVS?) (diffN)
by tye (Sage) on Jan 06, 2007 at 07:22 UTC

    This goes beyond what diff3 can do. diff3 relies on a "common ancestor" model in order to simplify what it does.

    There is no unique solution to this problem. You can go about finding the commonality in too many different ways that could result in different output (as your hinted at).

    But one way to approach this problem is to select one list and 'diff' each of the other lists against it. Then you merge (as in, use the merge algorithm) these N-1 'diff's (using position in the 'selected' list to control the merging), pushing out common elements when you merge a matched element and collecting new sublists when merging unmatched chunks. When you find a match M, before pushing it out, any collected sublists need to be diffed and merged and all but trailing unmatched elements pushed out, except that even trailing unmatched elements need to be pushed out for lists that participate in the match M.

    For example, start with these lists:

    To implement this, I'd always 'select' the last list so that $diff[0] is the diff between @{$seq[0]} and @{$seq[-1]}. And I'd make an object that lets me move down a diff one element (of the second list within the diff) at a time and knows how to collect the sublist. The top-level code might look something like this (to simplify the code I've transposed the output matrix from how you expected it):

    my @seq= GetSequences(); # ( \@seq0, \@seq1, \@seq2, ... ); my @out= DiffMerge( 1, @seq ); sub DiffMerge { my( $finish, @seq )= @_; my @diff= map { DiffToMerge->new( $_, $seq[-1] ) } @seq[ 0 .. $#seq-1 ]; my @out; for my $i ( 0 .. $#{ $seq[-1] } ) { my @row; my $same= 0; my %sublists; my @flush; for my $d ( 0 .. $#diff ) { for( $diff[$d]->SubList() ) { $sublists{$d}= $_ if $_; } if( $diff[$d]->Same($i) ) { $same++; push @flush, $d if $sublists{$d}; push @row, $diff[$d]->Shift(); } else { push @row, undef; } } if( @flush ) { for my $row ( DiffMerge( 0, values %sublists ) ) { my @subrow; for my $d ( keys %sublist ) { $subrow[$d]= $diff[$d]->SublistOffset() + shift @$ +row; } push @out, \@subrow; } FlushSublist( \@out, $_, $diff[$_] ) for @flush; } if( $same ) { push @row, $i; push @out, \@row; } else { my @r; $r[@diff]= $i; push @out, \@r; } } if( $finish ) { FlushSublist( \@out, $_, $diff[$_] ) for 0 .. $#diff; } return @out; }

    (Above code updated.)

    - tye        

      And here is rather ugly but working code:

      #!/usr/bin/perl -w use strict; use Algorithm::Diff 1.19 (); Main( @ARGV ); exit( 0 ); { package DiffToMerge; sub new { my $class= shift @_; my $diff= Algorithm::Diff->new( @_ ); $diff->Next(); # Start out at first chunk my $me= bless { diff => $diff, offset => 0, }, $class; if( $diff->Diff() ) { $me->{sublist}= [ $diff->Items( 1 ) ]; $me->{suboff}= $diff->Min( 1 ); } return $me; } sub Same { my( $me, $off )= @_; for( $me->{offset} ) { die if $off < $_; die if $_+1 < $off; $_= $off; } my $diff= $me->{diff}; if( $diff->Max(2) < $off ) { die if ! $diff->Next(); if( $diff->Diff() ) { $me->{sublist}= [ $diff->Items( 1 ) ]; $me->{suboff}= $diff->Min( 1 ); } if( ! $diff->Range(2) ) { die if $off <= $diff->Max(2); die if ! $diff->Next(); } die if $off < $diff->Min(2); die if $diff->Max(2) < $off; } return $diff->Same(); } sub Shift { my( $me )= @_; my $diff= $me->{diff}; die if ! $diff->Same(); return $diff->Min(1) + $me->{offset}++ - $diff->Min(2); } sub Sublist { my( $me )= @_; my $sublist= $me->{sublist}; return if ! $sublist || ! @$sublist; return $sublist; } sub SublistOffset { my( $me, $offset )= @_; return undef if ! defined $offset; shift @{ $me->{sublist} }; return $me->{suboff}++; } sub SublistIdxs { my( $me )= @_; my $beg= $me->{suboff}; $me->{suboff} += @{ $me->{sublist} || [] }; $me->{sublist}= []; return $beg .. $me->{suboff}-1; } } sub FlushSublist { my( $avOut, $d, $diff )= @_; for( $diff->SublistIdxs() ) { my @row; $row[$d]= $_; push @$avOut, \@row; } } sub DiffMerge { my( $finish, @seq )= @_; my @diff= map { DiffToMerge->new( $_, $seq[-1] ) } @seq[ 0 .. $#seq-1 ]; my @out; for my $i ( 0 .. $#{ $seq[-1] } ) { my @row; my %sublists; my @flush; for my $d ( 0 .. $#diff ) { my $same= $diff[$d]->Same($i); for( $diff[$d]->Sublist() ) { $sublists{$d}= $_ if $_; } if( $same ) { push @flush, $d if $sublists{$d}; $row[$d]= $diff[$d]->Shift(); } } if( @flush ) { for my $row ( DiffMerge( 0, values %sublists ) ) { my @subrow; for my $d ( keys %sublists ) { $subrow[$d]= $diff[$d]->SublistOffset( shift @$row + ); } push @out, \@subrow; } FlushSublist( \@out, $_, $diff[$_] ) for @flush; } $row[@diff]= $i; push @out, \@row; } if( $finish ) { for( 0 .. $#diff ) { FlushSublist( \@out, $_, $diff[$_] ) if $diff[$_]->Sublist(); } } return @out; } sub Main { my @seq= @_; @seq= qw( abdefhi azcdfghi abcjkgnhi ) if ! @seq; @seq= map [ /./gs ], @seq; my @out= DiffMerge( 1, @seq ); for( @seq ) { push @$_, '-'; } for( @out ) { for( @$_ ) { $_= '-' if ! defined $_; } print "( @$_ )\n"; } print $/; for( @seq ) { my @l= map shift @$_, @out; for( @l ) { $_= -1 if ! defined $_ || '-' eq $_; } print "( @$_[@l] )\n"; } }

      can be used as:

      % perl diffN.pl ( 0 0 0 ) ( 1 - 1 ) ( 2 - ) ( 3 - ) ( 4 - ) ( - 1 ) ( - 2 2 ) ( - - 3 ) ( - - 4 ) ( - 3 ) ( - 4 ) ( - 5 5 ) ( - - 6 ) ( 5 6 7 ) ( 6 7 8 ) ( a b d e f - - - - - - - - h i ) ( a - - - - z c - - d f g - h i ) ( a b - - - - c j k - - g n h i ) % perl diffN.pl tye says japhy asked ( - - 0 ) ( - 0 ) ( - 1 1 0 ) ( - - 2 ) ( - - 3 ) ( 0 - - ) ( 1 2 4 ) ( - 3 - 1 ) ( - - - 2 ) ( 2 - - 3 ) ( - - - 4 ) ( - - - - - t y - - e - ) ( - s a - - - y s - - - ) ( j - a p h - y - - - - ) ( - - a - - - - s k e d )

      Update: And here is a case with a bug to be fixed:

      % perl diffN.pl encyclopedia cyclops enclosure wikipedia cyclone wick +ed lonely ( - - - 0 - 0 ) ( - - - 1 - 1 ) ( 0 - 0 - - - ) ( 1 - 1 - - - ) ( 2 0 2 - 0 2 ) ( - - - 2 - 3 ) ( 3 ) ( 4 ) ( - 1 ) ( - 2 ) ( - - - - 1 ) ( - - - - 2 ) ( 5 3 3 - 3 - 0 ) ( 6 4 4 - 4 - 1 ) ( - - - - 5 - 2 ) ( - - - 3 ) ( 7 5 - 4 ) ( - 6 5 - ) ( - - 6 - ) ( - - 7 - ) ( 8 - 8 5 6 4 3 ) ( - - - - - - 4 ) ( - - - - - - 5 ) ( 9 ) ( 10 ) ( 11 ) ( - - - 6 ) ( - - - 7 ) ( - - - 8 ) ( - - - - - 5 ) ( - - e n c - y c - - - - l o - - p - - - e - - d i a - - - - ) ( - - - - c - - - y c - - l o - - p s - - - - - - - - - - - - ) ( - - e n c - - - - - - - l o - - - s u r e - - - - - - - - - ) ( w i - - - k - - - - - - - - - i p - - - e - - - - - d i a - ) ( - - - - c - - - - - y c l o n - - - - - e - - - - - - - - - ) ( w i - - c k - - - - - - - - - - - - - - e - - - - - - - - d ) ( - - - - - - - - - - - - l o n - - - - - e l y - - - - - - - ) # ^ ^ ^ ^ ^ ^ < should merge > ^ ^ ^ ^ ^ ^ ^

      - tye        

        Playing around with some code for this, there are often multiple ways in which differences could be merged. For example, from your test above:

        encyclo pedia enc losur e #or encyclop edia enc lo sure

        To me, these are both equally legitimate, but is there any criteria upon which the one could be determined to be preferrable to the other?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reconciling multiple lists (similar to "merge" in CVS?)
by Popcorn Dave (Abbot) on Jan 06, 2007 at 01:56 UTC
    This may be a long winded way of doing it, but what about this?

    1.Dump all your arrays in to a hash to get frequency of the letters

    2. Rebuild array comparing original array element and hash key, and checking if value is < 1 then that element is undef.

    Unless I've misunderstood what you want, doesn't that do what you're after?

    Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

    If quizzes are quizzical, what are tests?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://593215]
Approved by planetscape
Front-paged by tye
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2018-06-21 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (118 votes). Check out past polls.

    Notices?