Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
go ahead... be a heretic
 
PerlMonks  

string diff and compare

by raybies (Chaplain)
on May 03, 2012 at 20:32 UTC ( #968807=perlmeditation: print w/ replies, xml ) Need Help??

So I admit I've got some playing to do, as I've never used Algorithm::Diff, and String::Diff, but here's my current implementation an internal line diff for a text to html diff compare tool I've written.

I wanted to show inline differences. I have two strings that I *know* have some difference in them (thanks to a call to diff).

I put  <span> </span> tags around the differences in a string pair. Both in the old string and the new string. (to be used in a 2 column html table showing diffs side by side).

#!/usr/bin/perl use strict; use warnings; while (<DATA>) { my $cmpline = <DATA>; # I always have a pair to compare my ($spanold, $spannew) = spandiffstr ($_, $cmpline); print "$spanold"; print "$spannew"; } sub spandiffstr { my ($old, $new) = @_; my $span = '<span>'; #eventually I'll pass this in as arg my $espan = '</span>'; #ditto my @cold = split //, $old; my @cnew = split //, $new; my $longer = (@cold > @new) ? $#cold : $#cnew; my $subscr; #find first diff in string from front. for (0..$longer) { if ($cold[$_] ne $cnew[$_]) { $subscr = $_; last; } } my $losubscr; #two subscripts because differ len of new and old my $lnsubscr; @cold = reverse @cold; @cnew = reverse @cnew; #we're gonna search these backwards for (0..$longer) { if ($cold[$_] ne $cnew[$_]) { $losubscr = @cold - $_; $lnsubscr = @cnew - $_; last; } } #got pertinent subscripts, now insert span tags substr ($old, $losubscr, 0) = $espan; substr ($new, $lnsubscr, 0) = $espan; substr ($old, $subscr, 0) = $span; substr ($new, $subscr, 0) = $span; return ($old, $new); } __DATA__ I love Perl. I love Your Mom. I love to eat spaghetti. I love to eat spaghetti and meatballs. I love to smell the flowers. I hate to smell the flowers. Stop! That's enough. Not really enough. Blood and Tears are expected. Love and Marriage are exempted.
The output from this script looks like the follows.
I love <span>Perl</span>. I love <span>Your Mom</span>. I love to eat spaghetti<span></span>. I love to eat spaghetti<span> and meatballs</span>. I <span>lov</span>e to smell flowers. I <span>hat</span>e to smell flowers. <span>Stop! That's</span> enough. <span>Not really</span> enough. <span>Blood and Tears are expec</span>ted. <span>Love and Marriage are exemp</span>ted.

I guess I'm curious if there's a better way to create spans on a string, and wanted to share this version that was kinda fun, and hear if folks would implement it more idiomatically.

I've been looking at the two modules cited above, it looks might there might be a better way to create multiple spans depending on say multiple changes in the lines, as opposed to the way I do it where I span only from the first and last differences, not taking into account internal matching withing the single string. So that for exampled the last two lines might instead produce multiple spans in a single string: LIke so...

<span>Blood</span> and <span>Te</span>ar<span>s</span> are ex<span>pec +</span>ted. <span>Love</span> and <span>M</span>ar<span>riage</span> are ex<span>e +mp</span>ted.

Of course such an implementation may be a pain... so Then again, my current approach might be good enough. :)

As always I find all you experts and your input invaluable in helping me to think of algorithms and the language differently. Thanks ahead of time for any comments.

So how would you do it? :)

Comment on string diff and compare
Select or Download Code
Re: string diff and compare
by jwkrahn (Monsignor) on May 04, 2012 at 08:47 UTC

    you have an error at:

    my @new = split //, $new; my $longer = (@cold > @new) ? $#cold : $#new;

    Which should be:

    my @cnew = split //, $new; my $longer = (@cold > @cnew) ? $#cold : $#cnew;

    And at:

    substr ($old, subscr, 0) = $span; substr ($new, subscr, 0) = $span;

    Which should be:

    substr ($old, $subscr, 0) = $span; substr ($new, $subscr, 0) = $span;


    Also this:

    my $losubscr; #two subscripts because differ len of new and old my $lnsubscr; @cold = reverse @cold; @cnew = reverse @cnew; #we're gonna search these backwards for (0..$longer) { if ($cold[$_] ne $cnew[$_]) { $losubscr = @cold - $_; $lnsubscr = @cnew - $_; last; } } #got pertinent subscripts, now insert span tags substr ($old, $losubscr, 0) = $espan; substr ($new, $lnsubscr, 0) = $espan; substr ($old, subscr, 0) = $span; substr ($new, subscr, 0) = $span;

    Could be shortened to:

    my $lsubscr; @cold = reverse @cold; @cnew = reverse @cnew; #we're gonna search these backwards for (0..$longer) { if ($cold[$_] ne $cnew[$_]) { $lsubscr = -$_; last; } } #got pertinent subscripts, now insert span tags substr $old, $lsubscr, 0, $espan; substr $new, $lsubscr, 0, $espan; substr $old, $subscr, 0, $span; substr $new, $subscr, 0, $span;


    Another way to write that would be:

    while ( defined( my $first = <DATA> ) && defined( my $second = <DATA> +) ) { print spandiffstr( $first, $second ); } sub spandiffstr { my ( $old, $new ) = @_; my $span = '<span>'; my $espan = '</span>'; ( $old ^ $new ) =~ /^\0*/ and my $subscr = $+[0]; my ( $rold, $rnew ) = map scalar reverse, $old, $new; ( $rold ^ $rnew ) =~ /^\0*/ and my $lsubscr = -$+[0]; # got pertinent subscripts, now insert span tags substr $old, $lsubscr, 0, $espan; substr $new, $lsubscr, 0, $espan; substr $old, $subscr, 0, $span; substr $new, $subscr, 0, $span; return $old, $new; }

      Thanks jw!

      I fixed the errors (my internet machine, and my Perl are on two different systems, and I copied it by hand...).

      The optimizations are really interesting (still pondering it in awe...), I didn't know about @+ til this example. That makes matching strings so much better than having to pull them appart into seperate character arrays. The trick of using negative subscripts is really cool too. Thanks for taking the time to respond, it's little things like this that makes Perl a pure expression of algorithmic joy... :-D

        I fixed the errors

        Except for this one:

        my $longer = (@cold > @new) ? $#cold : $#new;

        Which should be:

        my $longer = (@cold > @cnew) ? $#cold : $#cnew;

      So I've gone through your example now. I understand everything in it, except...

       my ( $rold, $rnew ) = map scalar reverse, $old, $new;

      What's up with the map scalar reverse syntax? What does the scalar do for example? I mean, I get that you're reversing two strings and putting them into two new strings, but I am really having difficulty parsing how Perl reads this to do what you've done here.

      Anyone care to explain it, decompress it (for us mortals)?

        From perlod ...

        reverse LIST
        In list context, returns a list value consisting of the elements of LIST in the opposite order. In scalar context, concatenates the elements of LIST and returns a string value with all characters in the opposite order.
        ...

        And, read map chain as: map { scalar( reverse $_ ) } ( $old , $new ).

        my ( $rold, $rnew ) = map scalar reverse, $old, $new;

        Is just a different way of saying:

        my $rold = reverse $old; my $rnew = reverse $new;

        You need to use scalar because reverse is in list context in the map expression.

      So, fwiw, I had to do the following to your code to get your method to work the way I wanted and cover all corner cases I've tested. (If the change spans to the very last character for example, then your first condition fails and you never assign $+[0] to $lsubscr ... So I had to do the following.
      . . . ($rold ^ $rnew) =~ /^\0*/ and my $lsubscr = (-$+[0] + 1 && $+[0]) ? +-$+[0] + 1 : length $new; my $losubscr = ($lsubscr == length $new) ? length $old : $lsubscr; substr $old, $losubscr, 0, '</span>'; . . .

      This was to fix if the difference was the very last string char and one character is missing from another. Like in the case of data like:

      All sentences end with what All sentences end with what?

      In this case the difference is -1, but because I had to add one to it, it was set to 0. But 0 is the beginning of the string, not the character just past the end of the string.

        So for the sake of completeness, and posterity... and not because anyone cares, here's the final version that I decided fixed things up the way I wanted. Turns out by reordering the substr I was able to remove the space ambiguity I was seeing in one of my example data failures where I was getting nested span tags.
        sub spandiffstr { my ( $old, $new ) = @_; my $span = '<span>'; my $espan = '</span>'; ( $old ^ $new ) =~ /^\0*/ and my $subscr = $+[0]; substr $old, $subscr, 0, $span; substr $new, $subscr, 0, $span; my ( $rold, $rnew ) = map scalar reverse, $old, $new; ( $rold ^ $rnew ) =~ /^\0*/ and my $lsubscr = -$+[0]; my $losubscr = $lsubscr; unless ($lsubscr) { $losubscr = length $old; $lsubscr = length $new; } substr $old, $losubscr, 0, $espan; substr $new, $lsubscr, 0, $espan; return $old, $new; }
        Thanks for all the fish!
Re: string diff and compare
by raybies (Chaplain) on May 07, 2012 at 13:48 UTC

    So I found an interesting failure of this approach when I was trying out this on more data. It occurs when I have the following sort of data.

    __DATA__ I am me. I am not me.

    Essentially, I end up creating the following output.

    I am<<span>/span> me. I am <span>not</span> me.

    The problem is due to the fact that the space before the not (in "am not me"), and the space after the not are not the same space, but the matching can't tell that and because you search from the back of the string first, and then the front of the string, yet both are fine with the space between "am me." so the offsets are off by one.

    fixing now... a quick solution is to arbitrarily add one to the negative index, but it's clearly imperfect, because when this isn't the case, it'll highlight (span) one extra matching character.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://968807]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2014-04-20 17:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls