Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I've been using Algorithm::Diff a bit lately. I'm really glad we have this module and I like it in a lot of ways. But I've decided that it is okay for me to hate the interfaces provided.

Just some quick background. Algorithm::Diff takes two lists and returns what /usr/bin/diff would return if those were two lists of lines (often called "two text files"). You can get this 'difference' information in three different formats.

  1. a list containing the contents of the lines that are the same
  2. a call to one of three subroutines for each line
  3. a list of lists of lists of information about each line that is different

Format (1) isn't what I want (most of the time) because it contains copies of the contents of the lines and not the line numbers.

Format (2) is a problem mainly because it is not defined in what order the calls will be made when you have several lines in each file between two blocks of "same" lines. If you have qw( a b c 1 2 3 4 x y z ) vs qw( a b c 7 8 9 x y z ), you know you get 'a' first, then 'b', then 'c'. But then you don't know if you'll get 1 2 3 4 7 8 9 or 1 7 2 8 3 9 4 or 7 8 1 2 9 3 4 or ...

Another problem is that you don't get qw( a b c ) together. Depending on what you are doing, you'll likely have to save up 'a', then 'b', then 'c' (luckily this isn't too bad since these are all calls to the same subroutine). Then, you have to have the other two subroutines note, "Oh, we hit the end of that sequence, we can deal with those lines you saved up in that other subroutine now". I've written some "slick" code that does this (more than once), but I wouldn't call it easy to understand and simple changes in how you want to process things can require major code rework so it certainly isn't easy to maintain.

Format (3) is better in several ways. But it has the potential to make huge numbers of tiny lists of tiny lists (that consume a lot of memory). And these lists contain copies of the line contents (which consume more memory). And the data is split up too much and so is hard to deal with in some ways.

So what I want is a list of line number ranges that coorespond to the runs of 'different' or 'same' lines between the files. So I extended it to provide exactly that and I'm posting it for public review before I finish the patch and forward it to the module owner.

First, here is some rough documentation on what you get back. Since we already have an interface called diff() that I find difficult to use, I've (temporarily) called my interface easy() (yes, I have to think of a better name). If you take this example invocation (mostly taken straight from the module documentation):

my @a= qw( a b c e h j l m n p ); my @b= qw( b c d e f j k l m r s t ); my @diff= easy( \@a, \@b );

@diff will contain 5 elements for each range of either 'different' or 'same' lines between the two lists. The 5 elements are $same, $aMin, $aMax, $bMin, and $bMax. $same is 1 if the ranges represent lines that the files have in common and 0 if the ranges represent differences. The other 4 values are line numbers (well, they are array indices and so are zero-based while line numbers would usually be one-based). @a[$aMin..$aMax] will give the lines from @a that fall into that range while @b[$bMin..$bMax] gives the lines from @b.

Note that $aMin..$aMax is the empty set when $aMax < $aMin (specifically, we'll always use $aMax == $aMin-1) and so @a[$aMin..$aMax] will also be the empty set.

So, for our example, @diff contains (I've used "=>" in place of "," where I want you to think ".."; so the code below is valid Perl that produces the same list as @diff):

# @a @b @a @b # same? range range values values ( 0, 0=>0, 0=>-1, # a ! 1, 1=>2, 0=> 1, # b c = b c 0, 3=>2, 2=> 2, # ! d 1, 3=>3, 3=> 3, # e = e 0, 4=>4, 4=> 4, # f ! h 1, 5=>5, 5=> 5, # j = j 0, 6=>5, 6=> 6, # ! k 1, 6=>7, 7=> 8, # l m = l m 0, 8=>9, 9=>11, # n p ! r s t );

Now the hope is that this will be easy to use. So let's try to use it. First, an HTML difference. (I've switched from @diff to $diff for efficiency and to show that it doesn't complicate the code much.)

my @a= qw( a b c e h j l m n p ); my @b= qw( b c d e f j k l m r s t ); my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; my $a= join "", @a[$aMin..$aMax]; if( $same ) { print $a; } else { print "<s>$a</s>" if "" ne $a; my $b= join "", @b[$bMin..$bMax]; print "<b>$b</b>" if "" ne $b; } }

which produces

<s>a</s>bc<b>d</b>e<s>h</s><b>f</b>j<b>k</b>lm<s>np</s><b>rst</b>

or "a b c d e h f j k l m n p r s t" (with extra spaces and emphasis added for clarity).

Second, "unified" output from 'diff':

$_ .= $/ for @a, @b; my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; if( $same ) { for( @a[$aMin..$aMax] ) { print " $_"; } } else { for( @a[$aMin..$aMax] ) { print "-$_"; } for( @b[$bMin..$bMax] ) { print "+$_"; } } }

which produces

-a b c +d e -h +f j +k l m -n -p +r +s +t

Thirdly, the most complex, "traditional" output from 'diff':

my $diff= easy( \@a, \@b ); while( @$diff ) { my( $same, $aMin, $aMax, $bMin, $bMax )= splice @$diff, 0, 5; $_++ for $aMin, $aMax, $bMin, $bMax; next if $same; my $sep= ''; if( $bMax < $bMin ) { print "$aMin,$aMax","d$bMax\n"; } elsif( $aMax < $aMin ) { print "$aMax","a$bMin,$bMax\n"; } else { $sep= "---\n"; print "$aMin,$aMax","c$bMin,$bMax\n"; } for( @a[$aMin-1..$aMax-1] ) { print "< $_"; } print $sep; for( @b[$bMin-1..$bMax-1] ) { print "> $_"; } }

which produces

1,1d0 < a 3a3,3 > d 5,5c5,5 < h --- > f 6a7,7 > k 9,10c10,12 < n < p --- > r > s > t

Well, I think those blocks of code are pretty simple, so I hope that means I've hit on a useful format for the information.

Here is the subroutine that returns the new format:

sub easy { my $a= shift; # array ref my $b= shift; # array ref my @diff= ( 1, 0, -1, 0, -1 ); traverse_sequences( $a, $b, { MATCH => sub { my( $aEnd, $bEnd )= @_; $_++ for my( $aTop, $bTop )= @diff[ 2-5, 4-5 ]; if( $aTop < $aEnd || $bTop < $bEnd ) { push @diff, 0, $aTop, $aEnd-1, $bTop, $bEnd-1, 1, $aEnd, $aEnd, $bEnd, $bEnd; } else { @diff[ 2-5, 4-5 ]= ( $aEnd, $bEnd ); } }, }, @_ ); my( $aEnd, $bEnd )= ( 0+@$a, 0+@$b ); $_++ for my( $aTop, $bTop )= @diff[ 2-5, 4-5 ]; if( $aTop < $aEnd || $bTop < $bEnd ) { push @diff, 0, $aTop, $aEnd-1, $bTop, $bEnd-1; } splice @diff, 0, 5 if $diff[2] < 0 && $diff[4] < 0; return wantarray ? @diff : \@diff; }

Thanks for reviewing this and thanks in advance for any suggestions.

        - tye (but my friends call me "Tye")

In reply to Algorithm::Diff icult to use by tye

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2024-03-19 05:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found