good chemistry is complicated,
and a little bit messy -LW
You've probably used the Unix diff program, or one of its Win32 descendants, or some program that depends directly on it, such as patch, or the CVS source code maintenance system. All of these require calculating the smallest set of differences between two sequences (or equivalently, figuring out the longest subsequences that are the same). It's an intuitively simple problem, but surprisingly difficult to solve and implement correctly. That's what this module does, and it does it well.
The algorithm implemented in the module was originally published in a May 1977 Communications of the ACM article entitled "A Fast Algorithm for Computing Longest Common Subsequences" by J. W. Hunt and T. G. Szymanski. The Unix diff program was originally written by D. McIlroy and J. W. Hunt, who described their work in Bell Laboratories Technical Report 41 in 1976. This module was originally written by Mark-Jason Dominus, but the existing version is by Ned Konz. It's available on CPAN (alternate).
The difficulty of the algorithm precludes sitting down and just "grinding it out"; when you need it, this module's a life-saver. It provides three separate ways to get at the algorithm -- in other words, three interfaces to the core functionality. The documentation mentions these in order of increasing generality, but they're also slightly different in what they do. They are:
When would you use each one? You might consider using LCS if all you need is the longest match between the arrays. You could use diff to print diff-like output. For example, here's an extremely simple diff program (more for illustration than usefulness):
Finally, traverse_sequences is very handy when you already know what you need to do in response to a difference (say, delete an item).
All of these routines accept a "key generation function" as an optional final argument. This is needed because internally the routine uses numeric comparison (the <=> operator, in principle). But what if the values you have are not numeric? Or if you have a more complicated data structure, such as an array of hashes (such as object instances)? The key generation function should, given a reference to an object in your array, return a numeric value that can be used to meaningfully compare it to another element. That is, it should return some sort of hash or key that represents the element. If the elements are logically the same, it should return the same value (note that the ref operator would almost never be a good choice.)
To me, the key generation functions have always seemed unnatural. I'm more used to being able to supply the actual comparison function, in the style of sort or grep. But that may cause a significant performance hit -- I don't know.
Another seeming inflexibility is that the input arrays have to be in memory. It would be nice to be able to process files without having to slurp them, by providing routines which returned the next element for the given array. (But my guess is that in the worst case -- having no elements in common -- that would still end up reading both files completely into memory.)
The above two are minor nit-picks; the only real problem I have with this module is that the documentation is hard to understand. The first time I downloaded it from CPAN, it took a good deal of trial-and-error to figure out what it was really doing. But once I got the hang of it, it's been extremely useful. Hopefully this node will help in that direction.
Algorithm::Diff is fast and relatively flexible. It implements a difficult algorithm in a nice package. Take the time to read and understand the documentation, and it will serve you well.
Brother tye has written an instructive node discussing the interfaces for returning data in Algorithm::Diff icult to use. There he also shows code for a more simplified interface that may be incorporated into the module in the future. If you're trying to figure out how to use the module, that code may be of great help.