|Perl: the Markov chain saw|
Re: Longest Common Subsequenceby Limbic~Region (Chancellor)
|on May 16, 2006 at 19:35 UTC||Need Help??|
I am providing a more comprehensive explanation of how my algorithm works in case it is of interest to someone that doesn't want to be bothered with reverse engineering the code. I have changed some of the variable names and added additional comments. I didn't start out with a clear understanding myself of what I was doing so as the code changed over time, variables took on different meanings but their names didn't always change. I hope this makes things much clearer.
The first thing I realized was that I needed to translate characters to positions to determine relatively if one character appeared after another. I used the index to correlate the strings with their character positions. This would turn a string like 'MOM' into
It doesn't do any good to know that 'O' follows 'M' unless it does so in every string. The next step then is to create a map of char to positions across strings. I can take advantage of the fact that LCS can't exceed the shortest string for an optimization. We will use the shortest string as our reference string.
We can also take advantage of the fact that if a character does not appear in every string, it can't possibly be part of the LCS. Ignoring those cases, we simply have to take each character in the shortest string in turn and ask what position(s) that character appears in all the strings. As long as "all the strings" are looked at in the same order, repeating this process will allow us to determine which characters appear after which other characters in all strings. For instance, 'O' in 'FLOP', 'MOM', and 'CODE' would look like:
Unfortunately, a character appearing more than once within a string needs special handling. If I take the first character from the shortest string 'M' and ask which position it appears in the string 'MOM', I get two answers. Algorithm::Loops by tye solves the problem for us quite nicely. We simply allow each string to hold a variable number of positions for each character and then generate all the different combinations. So for 'O' in 'MOM', 'COOL', 'FLOP', we end up with:
Which expands to
Since the result of this process will be a list of array refs that represent characters, we need a way to convert them back to characters once we have found our solution. I stringified the array reference and made it a hash key with the character as a value, but any number of solutions are possible.
Currently, we have a mapping of each character's (from shortest str) position across all strings and a lookup table to convert the mapping back to a character. We need to divide this list of mappings into piles. The first item in each pile is the anchor position and every other item is all mappings that are greater than the anchor position. Here, greater is defined as each position in the mapping being greater than the corresponding position in the anchor mapping. For instance:
It is important to note that the pile is not ordered. All you can conclude from the piles is that each mapping not in the anchor position is greater than the anchor position.
A naive approach (N^2) to this would be to loop through every mapping and then loop through every mapping again looking for mappings that are greater. A smarter approach ((N^2 - N)/ 2) is to loop through the mappings but only check the mappings that come after it. To make this work, we need to allow for the possibility that the first mapping is not greater than the second but it is less than (all values are less than corresponding value). This allows us to only make 1 comparison for the pair and know which is greater than which.
You will note in the implementation that the anchor is a hash key and the value is the rest of the pile. I will ignore for a second the fact that I am keeping track of how many items are greater than the anchor but it will come into play later.
Picking a pile at random, the key (anchor or root) represents a starting character and remainder of the pile (or leafs) represents characters that appear after that character in every string. If we treat each leaf as its own anchor (or root), we can find which branches can be made. In other words, it is possible to convert our piles into a trees where the longest path from a root to the leaf furthest away represents the longest common substring.
It turns out we don't have to build the tree itself. If we keep track of the path, depth, and last leaf as we go - we can convert our tree into a work queue. By keeping track of the maximum depth (or distance) we have obtained, once we have emptied our work queue we are done.
As we take an item off the queue, we can easily determine if we have reached a previously un-attained depth and record it. We can now make use of the number of items greater than an anchor. By taking the current depth and adding the number of items greater than the current leaf, we can determine if following this branch will ever lead to a new depth. We then treat each of the leaves below this point as new items in our work queue and put them on.
Now all we have to do is turn our colon delimited path back into characters. It is a good thing we kept that lookup table around.
Cheers - L~R