Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

You don't show an example of your "plain text", but splitting on /\b/ is probably a very bad idea when using "diff". For one thing, it gives you a potentially huge number of items and the worst-case performance for "diff" is probably somewhere between O(N^2) and O(N^4) so a huge number of items means a really huge worst-case run time.

But, more fundamentally, splitting on /\b/ is going to suck most with "diff" because you end up spending a lot of CPU time trying to calculate the optimal way to line up occurrences of " " (a single space) with each other. "diff" will prefer to match up spaces twice rather than match "electroencephalographer" with "electroencephalographer". And, if your text is anything like most text, single space will be the most common element by far and so your "diff" will often end up being ruled by how the single spaces are arranged, thus losing any useful comparison value.

So, the easiest approach is to instead split on " " or /\s+/ and not care about changes in whitespace. That still may give you a large number of items and cause the comparison to take too long for large texts that are heavily modified.

But, if you can't get away with ignoring changes in whitespace or things are still too slow sometimes, then you can go with a tiered approach. That is, split the texts into fairly large chunks and "diff" the results. Then re-diff runs of "changed" chunks but split into smaller pieces this time (not having to re-diff the "unchanged", "inserted", nor "deleted" chunks). And refuse to "diff" too large of a chunk at too small of a granularity -- such "diff"s are usually of little value, since they mostly just point out unimportant similarities at the expense of being hugely more complex than just "replace this one huge chunk with this other huge chunk".

And even if you go with a tiered approach, I'd not "diff" even small chunks of text by splitting on /\b/. If changes in whitespace matter, I'd still do the equivalent of splitting on /\s+/ and "diff"ing and then just compare the whitespace separators for equality (when they are between lined-up identical words). For example, assuming you have lots of punctuation in your texts:

#!/usr/bin/perl -w use strict; require Algorithm::Diff; my $oldText= "was this a test?"; my @oldChunks= $oldText =~ /(\s+|\w+|[^\w\s]+)/g; my @oldIdxs= grep $oldChunks[$_] =~ /\S/, 0..$#oldChunks; my @oldWords= @oldChunks[@oldIdxs]; push @oldIdxs, 0+@oldChunks; my $newText= "this is a test"; my @newChunks= $newText =~ /(\s+|\w+|[^\w\s]+)/g; my @newIdxs= grep $newChunks[$_] =~ /\S/, 0..$#newChunks; my @newWords= @newChunks[@newIdxs]; push @newIdxs, 0+@newChunks; { my @old= @oldChunks[ 0 .. $oldIdxs[0]-1 ]; my @new= @newChunks[ 0 .. $newIdxs[0]-1 ]; Show( \@old, \@new ); } my $diff= Algorithm::Diff->new( \@oldWords, \@newWords ); while( $diff->Next() ) { if( ! $diff->Same() ) { my( @old, @new ); @old= @oldChunks[ $oldIdxs[$diff->Min(1)] .. $oldIdxs[$diff->Max(1)+1]-1 + ] if $diff->Range(1); @new= @newChunks[ $newIdxs[$diff->Min(2)] .. $newIdxs[$diff->Max(2)+1]-1 + ] if $diff->Range(2); Show( \@old, \@new ); } else { # This is the complex case, because the non-whitespace # parts are the same but the whitespace may differ: my $oldIdx= $diff->Min(1); for my $newIdx ( $diff->Range(2) ) { Show( [], [ $newChunks[$newIdxs[$newIdx]] ] ); my @old= @oldChunks[ $oldIdxs[$oldIdx]+1 .. $oldIdxs[$oldIdx+1]-1 ] +; my @new= @newChunks[ $newIdxs[$newIdx]+1 .. $newIdxs[$newIdx+1]-1 ] +; Show( \@old, \@new ); $oldIdx++; } } } sub Show { my $oldText= join '', @{shift @_}; my $newText= join '', @{shift @_}; print "<strike>$oldText</strike>" if '' ne $oldText && $oldText ne $newText; print $newText; }

- tye        

In reply to Re: Optimizing Algorithm::Diff (fewer, better) by tye
in thread Optimizing Algorithm::Diff by tomazos

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    Domain Nodelet?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others cooling their heels in the Monastery: (5)
    As of 2024-07-19 09:09 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found

      erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.