Your skill will accomplishwhat the force of many cannot PerlMonks

### Optimizing Algorithm::Diff

by tomazos (Deacon)
 on Aug 14, 2005 at 17:01 UTC Need Help??

tomazos has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use Algorithm::Diff to markup the differances between two plain text sequences (one old, one new) by striking through the deleted bits and highlighting the new bits.

Here is the code:

```[snip]
%        my @old_body = split(/\b/, \$old_body);
%        my @new_body = split(/\b/, \$new_body);
%        my \$diff = Algorithm::Diff->new(\@old_body, \@new_body);
%        while (\$diff->Next()) {
%            if (\$diff->Diff()) {
<span class="old_text">
%                foreach (\$diff->Items(1)) {
<% display_pre(\$_) %>
%                }
</span>
<span class="new_text">
%                foreach (\$diff->Items(2)) {
<% display_pre(\$_) %>
%                }
</span>
%            } else {
%                foreach (\$diff->Same()) {
<% display_pre(\$_) %>
%                }
%            }
%        }
[snip]

It seems to work really quickly sometimes and really slowly other times, which is odd.

I Devel::Dprofed it and got the following:

```%Time ExclSec CumulS #Calls sec/call Csec/c  Name
47.5   46.88 46.888 123881   0.0000 0.0000  Algorithm::Diff::_replace
+NextLarge
rWith
17.3   17.05 64.094      1   17.056 64.093  Algorithm::Diff::_longest
+CommonSub
5  sequence
0.31   0.304  0.304  16299   0.0000 0.0000  HTML::Mason::Request::pri
+nt
0.16   0.153 64.717      1   0.1528 64.716  HTML::Mason::Request::cal
+l_next
0.11   0.106  0.154   5418   0.0000 0.0000  HTML::Mason::Commands::BE
+GIN
0.09   0.086  0.101      1   0.0855 0.1010  Algorithm::Diff::_withPos
+itionsOfI
nInterval
0.07   0.070  0.070      3   0.0233 0.0233  HTML::Mason::Interp::appl
+y_escapes
0.06   0.063  0.063   8911   0.0000 0.0000  Algorithm::Diff::__ANON__
0.05   0.050  0.060      7   0.0072 0.0086  HTML::Mason::Request::com
+p
0.04   0.038  0.038   5394   0.0000 0.0000  HTML::Entities::encode_en
+tities
0.03   0.030 64.947      1   0.0301 64.947  HTML::Mason::ApacheHandle
+r::handle
1  r
0.03   0.030 64.124      1   0.0300 64.123  Algorithm::Diff::LCSidx
0.02   0.020  0.020      1   0.0200 0.0200  DBI::connect
0.01   0.010  0.010      9   0.0011 0.0011  vars::import
0.01   0.010  0.010      1   0.0100 0.0100  Apache::Session::DESTROY

Any ideas as to what is causing the lag?

-Andrew.

Andrew Tomazos  |  andrew@tomazos.com  |  www.tomazos.com

Replies are listed 'Best First'.
Re: Optimizing Algorithm::Diff
by TedPride (Priest) on Aug 14, 2005 at 21:24 UTC
I would guess that the lag is the difference between the best case scenario (1-1 match across both texts) and the worst case scenario (texts have repeating sequences that aren't exactly the same). Do you get the same processing time every time for the same two sections of text? If so, you'll just have to live with the lag - there really isn't a good way to make every Diff computation run quickly and accurately at the same time.
Re: Optimizing Algorithm::Diff (fewer, better)
by tye (Sage) on Aug 15, 2005 at 16:20 UTC

You don't show an example of your "plain text", but splitting on /\b/ is probably a very bad idea when using "diff". For one thing, it gives you a potentially huge number of items and the worst-case performance for "diff" is probably somewhere between O(N^2) and O(N^4) so a huge number of items means a really huge worst-case run time.

But, more fundamentally, splitting on /\b/ is going to suck most with "diff" because you end up spending a lot of CPU time trying to calculate the optimal way to line up occurrences of " " (a single space) with each other. "diff" will prefer to match up spaces twice rather than match "electroencephalographer" with "electroencephalographer". And, if your text is anything like most text, single space will be the most common element by far and so your "diff" will often end up being ruled by how the single spaces are arranged, thus losing any useful comparison value.

So, the easiest approach is to instead split on " " or /\s+/ and not care about changes in whitespace. That still may give you a large number of items and cause the comparison to take too long for large texts that are heavily modified.

But, if you can't get away with ignoring changes in whitespace or things are still too slow sometimes, then you can go with a tiered approach. That is, split the texts into fairly large chunks and "diff" the results. Then re-diff runs of "changed" chunks but split into smaller pieces this time (not having to re-diff the "unchanged", "inserted", nor "deleted" chunks). And refuse to "diff" too large of a chunk at too small of a granularity -- such "diff"s are usually of little value, since they mostly just point out unimportant similarities at the expense of being hugely more complex than just "replace this one huge chunk with this other huge chunk".

And even if you go with a tiered approach, I'd not "diff" even small chunks of text by splitting on /\b/. If changes in whitespace matter, I'd still do the equivalent of splitting on /\s+/ and "diff"ing and then just compare the whitespace separators for equality (when they are between lined-up identical words). For example, assuming you have lots of punctuation in your texts:

```#!/usr/bin/perl -w
use strict;
require Algorithm::Diff;

my @oldChunks= \$oldText =~ /(\s+|\w+|[^\w\s]+)/g;
my @oldIdxs= grep \$oldChunks[\$_] =~ /\S/, 0..\$#oldChunks;
my @oldWords= @oldChunks[@oldIdxs];
push @oldIdxs, 0+@oldChunks;

my \$newText= "this is a test";
my @newChunks= \$newText =~ /(\s+|\w+|[^\w\s]+)/g;
my @newIdxs= grep \$newChunks[\$_] =~ /\S/, 0..\$#newChunks;
my @newWords= @newChunks[@newIdxs];
push @newIdxs, 0+@newChunks;

{
my @old= @oldChunks[ 0 .. \$oldIdxs[0]-1 ];
my @new= @newChunks[ 0 .. \$newIdxs[0]-1 ];
Show( \@old, \@new );
}
my \$diff= Algorithm::Diff->new( \@oldWords, \@newWords );
while(  \$diff->Next()  ) {
if(  ! \$diff->Same()  ) {
my( @old, @new );
@old= @oldChunks[
\$oldIdxs[\$diff->Min(1)] .. \$oldIdxs[\$diff->Max(1)+1]-1
+ ]
if   \$diff->Range(1);
@new= @newChunks[
\$newIdxs[\$diff->Min(2)] .. \$newIdxs[\$diff->Max(2)+1]-1
+ ]
if   \$diff->Range(2);
Show( \@old, \@new );
} else {
# This is the complex case, because the non-whitespace
# parts are the same but the whitespace may differ:
my \$oldIdx= \$diff->Min(1);
for my \$newIdx (  \$diff->Range(2)  ) {
Show( [], [ \$newChunks[\$newIdxs[\$newIdx]] ] );
my @old= @oldChunks[
\$oldIdxs[\$oldIdx]+1 .. \$oldIdxs[\$oldIdx+1]-1 ]
+;
my @new= @newChunks[
\$newIdxs[\$newIdx]+1 .. \$newIdxs[\$newIdx+1]-1 ]
+;
Show( \@old, \@new );
\$oldIdx++;
}
}
}

sub Show
{
my \$oldText= join '', @{shift @_};
my \$newText= join '', @{shift @_};
print "<strike>\$oldText</strike>"
if  '' ne \$oldText  &&  \$oldText ne \$newText;
print \$newText;
}

- tye

Re: Optimizing Algorithm::Diff (XS)
by tye (Sage) on Aug 15, 2005 at 18:40 UTC

Note that you can use Algorithm::LCS to have the meat of the work done in XS (which should be quite a bit faster). I haven't used it yet and I probably should integrate it with Algorithm::Diff, so switching to it probably won't be a simple as it eventually will be.

- tye

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://483708]
Approved by Corion
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-17 08:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found