http://www.perlmonks.org?node_id=191668

I've been using Algorithm::Diff a bit lately. I'm really glad we have this module and I like it in a lot of ways. But I've decided that it is okay for me to hate the interfaces provided.

Just some quick background. Algorithm::Diff takes two lists and returns what /usr/bin/diff would return if those were two lists of lines (often called "two text files"). You can get this 'difference' information in three different formats.

  1. a list containing the contents of the lines that are the same
  2. a call to one of three subroutines for each line
  3. a list of lists of lists of information about each line that is different

Format (1) isn't what I want (most of the time) because it contains copies of the contents of the lines and not the line numbers.

Format (2) is a problem mainly because it is not defined in what order the calls will be made when you have several lines in each file between two blocks of "same" lines. If you have qw( a b c 1 2 3 4 x y z ) vs qw( a b c 7 8 9 x y z ), you know you get 'a' first, then 'b', then 'c'. But then you don't know if you'll get 1 2 3 4 7 8 9 or 1 7 2 8 3 9 4 or 7 8 1 2 9 3 4 or ...

Another problem is that you don't get qw( a b c ) together. Depending on what you are doing, you'll likely have to save up 'a', then 'b', then 'c' (luckily this isn't too bad since these are all calls to the same subroutine). Then, you have to have the other two subroutines note, "Oh, we hit the end of that sequence, we can deal with those lines you saved up in that other subroutine now". I've written some "slick" code that does this (more than once), but I wouldn't call it easy to understand and simple changes in how you want to process things can require major code rework so it certainly isn't easy to maintain.

Format (3) is better in several ways. But it has the potential to make huge numbers of tiny lists of tiny lists (that consume a lot of memory). And these lists contain copies of the line contents (which consume more memory). And the data is split up too much and so is hard to deal with in some ways.

So what I want is a list of line number ranges that coorespond to the runs of 'different' or 'same' lines between the files. So I extended it to provide exactly that and I'm posting it for public review before I finish the patch and forward it to the module owner.

First, here is some rough documentation on what you get back. Since we already have an interface called diff() that I find difficult to use, I've (temporarily) called my interface easy() (yes, I have to think of a better name). If you take this example invocation (mostly taken straight from the module documentation):

my @a= qw( a b c e h j l m n p ); my @b= qw( b c d e f j k l m r s t ); my @diff= easy( \@a, \@b );

@diff will contain 5 elements for each range of either 'different' or 'same' lines between the two lists. The 5 elements are $same, $aMin, $aMax, $bMin, and $bMax. $same is 1 if the ranges represent lines that the files have in common and 0 if the ranges represent differences. The other 4 values are line numbers (well, they are array indices and so are zero-based while line numbers would usually be one-based). @a[$aMin..$aMax] will give the lines from @a that fall into that range while @b[$bMin..$bMax] gives the lines from @b.

Note that $aMin..$aMax is the empty set when $aMax < $aMin (specifically, we'll always use $aMax == $aMin-1) and so @a[$aMin..$aMax] will also be the empty set.

So, for our example, @diff contains (I've used "=>" in place of "," where I want you to think ".."; so the code below is valid Perl that produces the same list as @diff):

# @a @b @a @b # same? range range values values ( 0, 0=>0, 0=>-1, # a ! 1, 1=>2, 0=> 1, # b c = b c 0, 3=>2, 2=> 2, # ! d 1, 3=>3, 3=> 3, # e = e 0, 4=>4, 4=> 4, # f ! h 1, 5=>5, 5=> 5, # j = j 0, 6=>5, 6=> 6, # ! k 1, 6=>7, 7=> 8, # l m = l m 0, 8=>9, 9=>11, # n p ! r s t );

Now the hope is that this will be easy to use. So let's try to use it. First, an HTML difference. (I've switched from @diff to $diff for efficiency and to show that it doesn't complicate the code much.)

my @a= qw( a b c e h j l m n p ); my @b= qw( b c d e f j k l m r s t ); my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; my $a= join "", @a[$aMin..$aMax]; if( $same ) { print $a; } else { print "<s>$a</s>" if "" ne $a; my $b= join "", @b[$bMin..$bMax]; print "<b>$b</b>" if "" ne $b; } }

which produces

<s>a</s>bc<b>d</b>e<s>h</s><b>f</b>j<b>k</b>lm<s>np</s><b>rst</b>

or "a b c d e h f j k l m n p r s t" (with extra spaces and emphasis added for clarity).

Second, "unified" output from 'diff':

$_ .= $/ for @a, @b; my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; if( $same ) { for( @a[$aMin..$aMax] ) { print " $_"; } } else { for( @a[$aMin..$aMax] ) { print "-$_"; } for( @b[$bMin..$bMax] ) { print "+$_"; } } }

which produces

-a b c +d e -h +f j +k l m -n -p +r +s +t

Thirdly, the most complex, "traditional" output from 'diff':

my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; $_++ for $aMin, $aMax, $bMin, $bMax; next if $same; my $sep= ''; if( $bMax < $bMin ) { print "$aMin,$aMax","d$bMax\n"; } elsif( $aMax < $aMin ) { print "$aMax","a$bMin,$bMax\n"; } else { $sep= "---\n"; print "$aMin,$aMax","c$bMin,$bMax\n"; } for( @a[$aMin-1..$aMax-1] ) { print "< $_"; } print $sep; for( @b[$bMin-1..$bMax-1] ) { print "> $_"; } }

which produces

1,1d0 < a 3a3,3 > d 5,5c5,5 < h --- > f 6a7,7 > k 9,10c10,12 < n < p --- > r > s > t

Well, I think those blocks of code are pretty simple, so I hope that means I've hit on a useful format for the information.

Here is the subroutine that returns the new format:

sub easy { my $a= shift; # array ref my $b= shift; # array ref my @diff= ( 1, 0, -1, 0, -1 ); traverse_sequences( $a, $b, { MATCH => sub { my( $aEnd, $bEnd )= @_; $_++ for my( $aTop, $bTop )= @diff[ 2-5, 4-5 ]; if( $aTop < $aEnd || $bTop < $bEnd ) { push @diff, 0, $aTop, $aEnd-1, $bTop, $bEnd-1, 1, $aEnd, $aEnd, $bEnd, $bEnd; } else { @diff[ 2-5, 4-5 ]= ( $aEnd, $bEnd ); } }, }, @_ ); my( $aEnd, $bEnd )= ( 0+@$a, 0+@$b ); $_++ for my( $aTop, $bTop )= @diff[ 2-5, 4-5 ]; if( $aTop < $aEnd || $bTop < $bEnd ) { push @diff, 0, $aTop, $aEnd-1, $bTop, $bEnd-1; } splice @diff, 0, 5 if $diff[2] < 0 && $diff[4] < 0; return wantarray ? @diff : \@diff; }

Thanks for reviewing this and thanks in advance for any suggestions.

        - tye (but my friends call me "Tye")

Replies are listed 'Best First'.
Re: Algorithm::Diff icult to use
by RMGir (Prior) on Aug 21, 2002 at 11:21 UTC
    Nice!

    Why one list, though? Wouldn't a LoL be simpler to use AND easily let the user see how many ranges there are? Well, I guess they can do scalar(@diff)/5, but that doesn't seem as intuitive.

    I don't know if the sublists would cause too much memory overhead, but it shouldn't be anywhere near as bad as the problem you cite with A::Diff.

    That's just a nit; it's nice and clean, and quite useable as is.
    --
    Mike

    Edit: Hmmm, nevermind. I guess if I want an LoL I can just do

    my @diff2; while(@diff) { push @diff2,[splice @diff,0,5]; }
      Hmmm, nevermind.

      Actually, I agree with your LoL remark. Returning a flat list seems too low level. Plus, if for any reason a new field needs to be added, so the number is no longer 5, it would break all code using it.

      Modified, the code could look like:

      my $diff= easy( \@a, \@b ); foreach ( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= @$_; ... # rest of code snipped because unaltered }

      or

      my $diff= easy( \@a, \@b ); while (my $row = shift @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= @$row; ... # rest of code snipped because unaltered }

      Never mind the extra overhead (if any)... No?

        Makes sense. You're right that "5" hardwired everywhere could cause trouble...

        Oh, and I like your foreach better than the shift approach. It's probably faster, and it's what foreach is for. :)
        --
        Mike

Re: Algorithm::Diff icult to use
by Aristotle (Chancellor) on Aug 21, 2002 at 19:18 UTC
    I strongly second the LoL notion. Also, I think it would be nice if each subarray contained $same, $aMin, $aMax, $aLen, $bMin, $bMax, $bLen seeing as $aLen = $aMax - $aMin + 1, which is clumsy if you need it often. Also, it might useful if $same was rather $diff, which would be 0 if no differences, 1 if $bLen == 0, 2 if $aLen == 0 and 3 if we have two different line runs. A (\@\@) prototype on the function might be nice; I'm not sure how often one will want to pass anonymous arrays constructed on the fly into it.

    Makeshifts last the longest.

      Thanks again to everyone for the feedback.

      First, the easy one. I'm not adding a prototype. The existing routines don't have prototypes and I only use prototypes in very specific situations and this isn't one of those.

      While a LoL isn't very deeply nested, it does open up the possiblity for creating a very large number of very small anonymous arrays. Each array adds a chunk of overhead. For smalls arrays, this overhead becomes a significant percentage. So I like splitting up the return value some but would make it optional.

      As for adding more fields for convenience, I'm reluctant for two reasons. The first is the same as above; it reduces the size of problem you can address with the same resources. The second is that it makes it harder to remember which index gives you back which bit of information.

      In fact, this has made me realize that much of what I was returning is redundant. I could reduce it from 5 down to just 2 like so:

      # @a @b @a @b # same? range range values values ( 0, 0=>0, 0=>-1, # a ! 1, 1=>2, 0=> 1, # b c = b c 0, 3=>2, 2=> 2, # ! d 1, 3=>3, 3=> 3, # e = e 0, 4=>4, 4=> 4, # f ! h 1, 5=>5, 5=> 5, # j = j 0, 6=>5, 6=> 6, # ! k 1, 6=>7, 7=> 8, # l m = l m 0, 8=>9, 9=>11, # n p ! r s t ); # @a @b @a @b # start start values values ( 0, 0, # = 0, 0, # a ! 1, 0, # b c = b c 3, 2, # ! d 3, 3, # e = e 4, 4, # f ! h 5, 5, # j = j 6, 6, # ! k 6, 7, # l m = l m 8, 9, # n p ! r s t 10, 12, # );

      by just making the loop logic a bit more complex like so:

      # old loop code ######################################## my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; # ... } # new loop code ######################################## my $diff= easy( \@a, \@b ); my $same= 0; while( @$diff ) { my( $aMin, $bMin, $aMax, $bMax )= @$diff; --$aMax; --$bMax; $same= 1 - $same; splice @$diff, 0, 2; # ... }

      which gives me a neat idea!

      I'll store the data very efficiently (a single list with 2 entries per 'hunk') but provide routines for traversing the list in a convenient manner (which also solves my 'naming' problem).

      my $diff= Algorithm::Diff->new( \@a, \@b ); $diff->base( 1 ); # Return line numbers, not indices while( $diff->next() ) { next if $diff->same(); my $sep= ''; if( ! $diff->bRange() ) { sprintf "%d,%dd%d\n", $diff->get(qw( aMin aMax bMax )); } elsif( ! $diff->aRange() ) { sprint "%da%d,%d\n", $diff->get(qw( aMax bMin bMax )); } else { $sep= "---\n"; sprintf "%d,%dc%d%d\n", $diff->get(qw( aMin aMax bMin bMax )); } print "< $_" for $diff->aRange(); print $sep; print "> $_" for $diff->bRange(); }

      Note how $diff->aRange() returns the lines falling into that range but, in a scalar context, returns the number of lines in that range (the $aLen you wanted) which also happens to be "false" only if there are no lines in that range (like the bits of $diff you asked for).

      Other methods would include

      • $diff->prev() to move along the chunks in the other direction
      • $diff->end() to reset the iterator prematurely so subsequently next() would go to the top while prev() would got to the bottom.
      • $diff->aMin() would be the same as $diff->get("aMin"), etc.
      Takes about 60% less RAM but is much more flexible. Thanks!

      Note, in case you plan to complain about me optimizing for RAM size by resorting to OO methods, which run a bit slower than regular subroutines (or direct manipulation of a data structure). It is a good trade-off, IMO, because if your problem requires 30% more RAM than what you have available, then either it just fails or it runs really, really slowly. While, it running 10% slower is a minor problem in comparison.

              - tye (but my friends call me "Tye")
        # old loop code ######################################## my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; # ... } # new loop code ######################################## my $diff= easy( \@a, \@b ); my $same= 0; while( @$diff ) { my( $aMin, $bMin, $aMax, $bMax )= @$diff; # ... splice @$diff, 0, 2; # ... }

        Ouch! Precisely what I wanted to avoid: a modification in the number of items. Well, you could export a constant for this number, whether it be 2 or 5.

        BTW if memory efficiency is what you're worried about, why don't you just use 1 scalar for each item? After all, these are all just integers, so messing with pack()/unpack() or vec() would allow you to combine them, in one scalar. Only an 8 (or 20) byte string per item! The joy! The savings! (The pain!)

        You are right that the changes in the structure would have cost more overhead, but I was proposing them for the convenience and readability they'd provide. I was going to interject some points in that vein as I was reading; until you presented the OO interface. That's perfect as far as I'm concerned.

        Makeshifts last the longest.

Re: Algorithm::Diff icult to use
by VSarkiss (Monsignor) on Aug 21, 2002 at 21:13 UTC

    My vote is to keep the flat list format you already have. In my past experience with Algorithm::Diff, I've found the output from diff a little too deeply nested to be, er, easy to use. If anyone prefers that format, they can use the existing interface.

    I keep thinking it would be nice to show empty ranges with (undef, undef), but I can't think of a situation, other than eyeballing the data, where $aMin < $aMax wouldn't also work.

    It took me a few seconds to realize what you were doing with @diff[2-5, 4-5]. Would it be easier to build the list backwards (with unshift), then reverse it before returning? (OT: Actually, I liked the 2-5 trick. I think I'll start using it myself....)

    These are just minor nitpicks. Overall, this is a very nice piece of work. I'll probably start using it even before the patch hits CPAN. As for a name, I suggest context, because it reminds me of diff -c.

      I've found the output from diff a little too deeply nested
      You can hardly call a LoL deeply nested though. The flat list format also requires monkeying with splice, which is definitely not preferrable IMHO.

      Makeshifts last the longest.