string diff and compare

So I admit I've got some playing to do, as I've never used Algorithm::Diff, and String::Diff, but here's my current implementation an internal line diff for a text to html diff compare tool I've written.

I wanted to show inline differences. I have two strings that I *know* have some difference in them (thanks to a call to diff).

I put <span> </span> tags around the differences in a string pair. Both in the old string and the new string. (to be used in a 2 column html table showing diffs side by side).

#!/usr/bin/perl
use strict;
use warnings;

while (<DATA>) {
  my $cmpline = <DATA>; # I always have a pair to compare
  my ($spanold, $spannew) = spandiffstr ($_, $cmpline);
  print "$spanold";
  print "$spannew";
}

sub spandiffstr {
  my ($old, $new) = @_;
  my $span = '<span>'; #eventually I'll pass this in as arg
  my $espan = '</span>'; #ditto
  my @cold = split //, $old;
  my @cnew = split //, $new;
  my $longer = (@cold > @new) ? $#cold : $#cnew;
  my $subscr;
  #find first diff in string from front.
  for (0..$longer) {
     if ($cold[$_] ne $cnew[$_]) {
        $subscr = $_;
        last;
     }
  }
  my $losubscr; #two subscripts because differ len of new and old
  my $lnsubscr; 
  @cold = reverse @cold;
  @cnew = reverse @cnew; #we're gonna search these backwards
  for (0..$longer) { 
    if ($cold[$_] ne $cnew[$_]) {
        $losubscr = @cold - $_;
        $lnsubscr = @cnew - $_;
        last;
    }
  }
  #got pertinent subscripts, now insert span tags
  substr ($old, $losubscr, 0) = $espan; 
  substr ($new, $lnsubscr, 0) = $espan;
  substr ($old, $subscr, 0) = $span;
  substr ($new, $subscr, 0) = $span;
  return ($old, $new); 
}
__DATA__ 
I love Perl.
I love Your Mom.
I love to eat spaghetti.
I love to eat spaghetti and meatballs.
I love to smell the flowers.
I hate to smell the flowers.
Stop! That's enough.
Not really enough.
Blood and Tears are expected.
Love and Marriage are exempted.
[download]

The output from this script looks like the follows.

I love <span>Perl</span>.
I love <span>Your Mom</span>.
I love to eat spaghetti<span></span>.
I love to eat spaghetti<span> and meatballs</span>.
I <span>lov</span>e to smell flowers.
I <span>hat</span>e to smell flowers.
<span>Stop! That's</span> enough.
<span>Not really</span> enough.
<span>Blood and Tears are expec</span>ted.
<span>Love and Marriage are exemp</span>ted.
[download]

I guess I'm curious if there's a better way to create spans on a string, and wanted to share this version that was kinda fun, and hear if folks would implement it more idiomatically.

I've been looking at the two modules cited above, it looks might there might be a better way to create multiple spans depending on say multiple changes in the lines, as opposed to the way I do it where I span only from the first and last differences, not taking into account internal matching withing the single string. So that for exampled the last two lines might instead produce multiple spans in a single string: LIke so...

<span>Blood</span> and <span>Te</span>ar<span>s</span> are ex<span>pec
+</span>ted.
<span>Love</span> and <span>M</span>ar<span>riage</span> are ex<span>e
+mp</span>ted.
[download]

Of course such an implementation may be a pain... so Then again, my current approach might be good enough. :)

As always I find all you experts and your input invaluable in helping me to think of algorithms and the language differently. Thanks ahead of time for any comments.

So how would you do it? :)

Comment on string diff and compare Select or Download Code

Replies are listed 'Best First'.
Re: string diff and compare by jwkrahn (Abbot) on May 04, 2012 at 08:47 UTC
you have an error at: `my @new = split //, $new; my $longer = (@cold > @new) ? $#cold : $#new;` [download] Which should be: `my @cnew = split //, $new; my $longer = (@cold > @cnew) ? $#cold : $#cnew;` [download] And at: `substr ($old, subscr, 0) = $span; substr ($new, subscr, 0) = $span;` [download] Which should be: `substr ($old, $subscr, 0) = $span; substr ($new, $subscr, 0) = $span;` [download] Also this: `my $losubscr; #two subscripts because differ len of new and old my $lnsubscr; @cold = reverse @cold; @cnew = reverse @cnew; #we're gonna search these backwards for (0..$longer) { if ($cold[$_] ne $cnew[$_]) { $losubscr = @cold - $_; $lnsubscr = @cnew - $_; last; } } #got pertinent subscripts, now insert span tags substr ($old, $losubscr, 0) = $espan; substr ($new, $lnsubscr, 0) = $espan; substr ($old, subscr, 0) = $span; substr ($new, subscr, 0) = $span;` [download] Could be shortened to: `my $lsubscr; @cold = reverse @cold; @cnew = reverse @cnew; #we're gonna search these backwards for (0..$longer) { if ($cold[$_] ne $cnew[$_]) { $lsubscr = -$_; last; } } #got pertinent subscripts, now insert span tags substr $old, $lsubscr, 0, $espan; substr $new, $lsubscr, 0, $espan; substr $old, $subscr, 0, $span; substr $new, $subscr, 0, $span;` [download] Another way to write that would be: while ( defined( my $first = <DATA> ) && defined( my $second = <DATA> +) ) { print spandiffstr( $first, $second ); } sub spandiffstr { my ( $old, $new ) = @_; my $span = '<span>'; my $espan = '</span>'; ( $old ^ $new ) =~ /^\0/ and my $subscr = $+[0]; my ( $rold, $rnew ) = map scalar reverse, $old, $new; ( $rold ^ $rnew ) =~ /^\0/ and my $lsubscr = -$+[0]; # got pertinent subscripts, now insert span tags substr $old, $lsubscr, 0, $espan; substr $new, $lsubscr, 0, $espan; substr $old, $subscr, 0, $span; substr $new, $subscr, 0, $span; return $old, $new; } [download]	[reply] [d/l] [select]
Re^2: string diff and compare by raybies (Chaplain) on May 04, 2012 at 15:36 UTC
So I've gone through your example now. I understand everything in it, except... `my ( $rold, $rnew ) = map scalar reverse, $old, $new;` What's up with the map scalar reverse syntax? What does the scalar do for example? I mean, I get that you're reversing two strings and putting them into two new strings, but I am really having difficulty parsing how Perl reads this to do what you've done here. Anyone care to explain it, decompress it (for us mortals)?	[reply] [d/l]
Re^3: string diff and compare by Anonymous Monk on May 04, 2012 at 15:54 UTC
From perlod ... reverse LIST In list context, returns a list value consisting of the elements of LIST in the opposite order. In scalar context, concatenates the elements of LIST and returns a string value with all characters in the opposite order. ... And, read map chain as: `map { scalar( reverse $_ ) } ( $old , $new )`.	[reply] [d/l]
Re^4: string diff and compare by Anonymous Monk on May 04, 2012 at 16:13 UTC
Re^3: string diff and compare by jwkrahn (Abbot) on May 04, 2012 at 21:27 UTC
`my ( $rold, $rnew ) = map scalar reverse, $old, $new;` [download] Is just a different way of saying: `my $rold = reverse $old; my $rnew = reverse $new;` [download] You need to use scalar because reverse is in list context in the map expression.	[reply] [d/l] [select]
Re^4: string diff and compare by tobyink (Canon) on May 05, 2012 at 07:17 UTC
Re^2: string diff and compare by raybies (Chaplain) on May 04, 2012 at 12:03 UTC
Thanks jw! I fixed the errors (my internet machine, and my Perl are on two different systems, and I copied it by hand...). The optimizations are really interesting (still pondering it in awe...), I didn't know about @+ til this example. That makes matching strings so much better than having to pull them appart into seperate character arrays. The trick of using negative subscripts is really cool too. Thanks for taking the time to respond, it's little things like this that makes Perl a pure expression of algorithmic joy... :-D	[reply]
Re^3: string diff and compare by jwkrahn (Abbot) on May 04, 2012 at 21:22 UTC
`I fixed the errors` [download] Except for this one: `my $longer = (@cold > @new) ? $#cold : $#new;` [download] Which should be: `my $longer = (@cold > @cnew) ? $#cold : $#cnew;` [download]	[reply] [d/l] [select]
Re^2: string diff and compare by raybies (Chaplain) on May 07, 2012 at 19:55 UTC
So, fwiw, I had to do the following to your code to get your method to work the way I wanted and cover all corner cases I've tested. (If the change spans to the very last character for example, then your first condition fails and you never assign $+[0] to $lsubscr ... So I had to do the following. `. . . ($rold ^ $rnew) =~ /^\0*/ and my $lsubscr = (-$+[0] + 1 && $+[0]) ? +-$+[0] + 1 : length $new; my $losubscr = ($lsubscr == length $new) ? length $old : $lsubscr; substr $old, $losubscr, 0, '</span>'; . . .` [download] This was to fix if the difference was the very last string char and one character is missing from another. Like in the case of data like: `All sentences end with what All sentences end with what?` [download] In this case the difference is -1, but because I had to add one to it, it was set to 0. But 0 is the beginning of the string, not the character just past the end of the string.	[reply] [d/l] [select]
Re^3: string diff and compare by raybies (Chaplain) on May 08, 2012 at 11:45 UTC
So for the sake of completeness, and posterity... and not because anyone cares, here's the final version that I decided fixed things up the way I wanted. Turns out by reordering the substr I was able to remove the space ambiguity I was seeing in one of my example data failures where I was getting nested span tags. `sub spandiffstr { my ( $old, $new ) = @_; my $span = '<span>'; my $espan = '</span>'; ( $old ^ $new ) =~ /^\0/ and my $subscr = $+[0]; substr $old, $subscr, 0, $span; substr $new, $subscr, 0, $span; my ( $rold, $rnew ) = map scalar reverse, $old, $new; ( $rold ^ $rnew ) =~ /^\0/ and my $lsubscr = -$+[0]; my $losubscr = $lsubscr; unless ($lsubscr) { $losubscr = length $old; $lsubscr = length $new; } substr $old, $losubscr, 0, $espan; substr $new, $lsubscr, 0, $espan; return $old, $new; }` [download] Thanks for all the fish!	[reply] [d/l]
Re: string diff and compare by raybies (Chaplain) on May 07, 2012 at 13:48 UTC
So I found an interesting failure of this approach when I was trying out this on more data. It occurs when I have the following sort of data. `__DATA__ I am me. I am not me.` [download] Essentially, I end up creating the following output. `I am<<span>/span> me. I am <span>not</span> me.` [download] The problem is due to the fact that the space before the not (in "am not me"), and the space after the not are not the same space, but the matching can't tell that and because you search from the back of the string first, and then the front of the string, yet both are fine with the space between "am me." so the offsets are off by one. fixing now... a quick solution is to arbitrarily add one to the negative index, but it's clearly imperfect, because when this isn't the case, it'll highlight (span) one extra matching character.	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks