Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re: Longest Common Subsequence

by Limbic~Region (Chancellor)
on May 16, 2006 at 19:35 UTC ( #549873=note: print w/replies, xml ) Need Help??

in reply to Longest Common Subsequence

I am providing a more comprehensive explanation of how my algorithm works in case it is of interest to someone that doesn't want to be bothered with reverse engineering the code. I have changed some of the variable names and added additional comments. I didn't start out with a clear understanding myself of what I was doing so as the code changed over time, variables took on different meanings but their names didn't always change. I hope this makes things much clearer.

The first thing I realized was that I needed to translate characters to positions to determine relatively if one character appeared after another. I used the index to correlate the strings with their character positions. This would turn a string like 'MOM' into

{M => [0, 2], O => [1]}

my @pos; for my $index (0 .. $#str) { my $line = $str[$index]; for my $offset (0 .. length($line) - 1) { my $char= substr($line, $offset, 1); push @{$pos[$index]{$char}}, $offset; } }

It doesn't do any good to know that 'O' follows 'M' unless it does so in every string. The next step then is to create a map of char to positions across strings. I can take advantage of the fact that LCS can't exceed the shortest string for an optimization. We will use the shortest string as our reference string.

my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str;

We can also take advantage of the fact that if a character does not appear in every string, it can't possibly be part of the LCS. Ignoring those cases, we simply have to take each character in the shortest string in turn and ask what position(s) that character appears in all the strings. As long as "all the strings" are looked at in the same order, repeating this process will allow us to determine which characters appear after which other characters in all strings. For instance, 'O' in 'FLOP', 'MOM', and 'CODE' would look like:

[2, 1, 1]

Unfortunately, a character appearing more than once within a string needs special handling. If I take the first character from the shortest string 'M' and ask which position it appears in the string 'MOM', I get two answers. Algorithm::Loops by tye solves the problem for us quite nicely. We simply allow each string to hold a variable number of positions for each character and then generate all the different combinations. So for 'O' in 'MOM', 'COOL', 'FLOP', we end up with:

[[1], [1, 2], [2]]
Which expands to
[1, 1, 2] [1, 2, 2]

Since the result of this process will be a list of array refs that represent characters, we need a way to convert them back to characters once we have found our solution. I stringified the array reference and made it a hash key with the character as a value, but any number of solutions are possible.

my (%lookup, @mapping); CHAR: for my $char (split //, $sh_str) { my @chr_pos_all_strs; for (0 .. $#pos) { # Ignore chars not in every string next CHAR if ! $pos[$_]{$char}; # All slots to be variable number of positions push @chr_pos_all_strs, $pos[$_]{$char}; } my $next = NestedLoops([@chr_pos_all_strs]); # Generate all combinations while (my @char_map = $next->()) { # stringified reference to convert back to char later my $ref = [@char_map]; $lookup{$ref} = $char; push @mapping, $ref; } }

Currently, we have a mapping of each character's (from shortest str) position across all strings and a lookup table to convert the mapping back to a character. We need to divide this list of mappings into piles. The first item in each pile is the anchor position and every other item is all mappings that are greater than the anchor position. Here, greater is defined as each position in the mapping being greater than the corresponding position in the anchor mapping. For instance:

[1, 2, 3] is greater than [0, 0, 1] [9, 1, 5] is NOT greater than [3, 2, 1] because 1 < 2
It is important to note that the pile is not ordered. All you can conclude from the piles is that each mapping not in the anchor position is greater than the anchor position.

A naive approach (N^2) to this would be to loop through every mapping and then loop through every mapping again looking for mappings that are greater. A smarter approach ((N^2 - N)/ 2) is to loop through the mappings but only check the mappings that come after it. To make this work, we need to allow for the possibility that the first mapping is not greater than the second but it is less than (all values are less than corresponding value). This allows us to only make 1 comparison for the pair and know which is greater than which.

You will note in the implementation that the anchor is a hash key and the value is the rest of the pile. I will ignore for a second the fact that I am keeping track of how many items are greater than the anchor but it will come into play later.

my %greater; for my $i (0 .. $#chr_pos_all_strs - 1) { for my $j ($i + 1 .. $#chr_pos_all_strs) { # Skip pairs not 100% > or < each other my $gt = is_greater(@chr_pos_all_strs[$i, $j]) or next; # Always order the pairs correctly my ($lg, $sm) = $gt == 1 ? ($i, $j) : ($j, $i); # Keep track of the new pile count for the anchor ++$greater{$chr_pos_all_strs[$sm]}[CNT]; # Add the mapping to the rest of the pile push @{$greater{$chr_pos_all_strs[$sm]}[NODE]}, "$chr_pos_all_ +strs[$lg]"; } } sub is_greater { my ($ref1, $ref2) = @_; # Determine if the first value is greater or smaller # Return false if they are equal (can't be > or <) my $cmp = $ref1->[0] <=> $ref2->[0] or return; # For the remaining values, verify that each position # is the same as the first (> or <) # Return false otherwise ($ref1->[$_] <=> $ref2->[$_]) == $cmp || return for 1 .. $#$ref1; # Return which was greater return $cmp; }

Picking a pile at random, the key (anchor or root) represents a starting character and remainder of the pile (or leafs) represents characters that appear after that character in every string. If we treat each leaf as its own anchor (or root), we can find which branches can be made. In other words, it is possible to convert our piles into a trees where the longest path from a root to the leaf furthest away represents the longest common substring.

It turns out we don't have to build the tree itself. If we keep track of the path, depth, and last leaf as we go - we can convert our tree into a work queue. By keeping track of the maximum depth (or distance) we have obtained, once we have emptied our work queue we are done.

# A max depth watermark and a path representing that depth my ($max, $path) = (0, ''); # Work queue # 0 => path, 1 => depth, 2 => last visited leaf my @work = map [$_, 1, $_], keys %greater;

As we take an item off the queue, we can easily determine if we have reached a previously un-attained depth and record it. We can now make use of the number of items greater than an anchor. By taking the current depth and adding the number of items greater than the current leaf, we can determine if following this branch will ever lead to a new depth. We then treat each of the leaves below this point as new items in our work queue and put them on.

while (@work) { my $item = pop @work; # Create lexicals to save a few dereferencing cycles my ($cur_depth, $route, $last_node) = @{$item}[DEPTH, PATH, LAST]; # Update our high water mark if appropriate ($max, $path) = ($cur_depth, $route) if $cur_depth > $max; # How many nodes are greater my $left = $greater{$last_node}[CNT]; # Finish if end of the path # Or if impossible to exceed current max depth next if ! $left || $cur_depth + $left <= $max; push @work, map ["$route:$_", $cur_depth + 1, $_], @{$greater{$las +t_node}[NODE]}; }

Now all we have to do is turn our colon delimited path back into characters. It is a good thing we kept that lookup table around.

my $hidden_msg = join '', map $lookup{$_}, split /:/, $path; return $hidden_msg;

Cheers - L~R

Replies are listed 'Best First'.
Re^2: Longest Common Subsequence
by Limbic~Region (Chancellor) on May 16, 2006 at 19:47 UTC

    If you understood the mental process I went through to create the Longest Common Subsequence, it should not be hard to see how to adapt it to the Longest Common Substring.

    Once we have a list of character positions across strings, we can create ordered piles where each mapping has a difference of 1 in each position with its adjacent neighbors. You can get rid of the need for looping to find mappings greater because you can predict the values they should contain. Then you just need to find the largest pile and that represents the LCS (S = substring).

    Arrays are replaced with hashes. Picking a hash key at random, you place it in a new pile and then predict the next mapping. If that key exists you place it on the same pile. You repeat this process until you have no more mappings. You then start back at the beginning of the pile and look for smaller items. Once no more items fit in that pile, you move on to the next hash key.

    Cheers - L~R

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://549873]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2017-08-19 01:19 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (310 votes). Check out past polls.