Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Longest Common Subsequence

by Limbic~Region (Chancellor)
on May 11, 2006 at 20:21 UTC ( #548827=perlmeditation: print w/ replies, xml ) Need Help??

All,
A few days ago, I posted a challenge to find a hidden message. The message can be revealed by determining the longest common subsequence of all the lines in the data. I was unable to find an implementation that handled more than two strings so I created my own. I originally imposed several constraints but was hoping someone would provide an efficient (non-brute force) method of handling the general case. The general case being:
  • Allows for more than 2 strings
  • Allows strings to be variable length
  • Allows chars to be repeated within a string
  • Allows chars to be absent from some strings
Here is that code that I ultimately came up with to solve the general case:
#!/usr/bin/perl use strict; use warnings; use Algorithm::Loops 'NestedLoops'; use List::Util 'reduce'; use constant NODE => 0; use constant PATH => 0; use constant CNT => 1; use constant DEPTH => 1; use constant LAST => 2; my @str = map {chomp; $_} <DATA>; print LCS(@str), "\n"; # Longest Common Subsequence in strings sub LCS{ my @str = @_; # Map pos of chars in each str my @pos; for my $i (0 .. $#str) { my $line = $str[$i]; for (0 .. length($line) - 1) { my $char= substr($line, $_, 1); push @{$pos[$i]{$char}}, $_; } } # Find the shortest string my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str; # Coord map & lookup of char across lines # Create permutations if char is duplicate # Skip any where char isn't in every line my (%lookup, @order); CHAR: for my $char (split //, $sh_str) { my @map; for (0 .. $#pos) { next CHAR if ! $pos[$_]{$char}; push @map, $pos[$_]{$char}; } my $next = NestedLoops([@map]); while (my @char_map = $next->()) { my $ref = [@char_map]; $lookup{$ref} = $char; push @order, $ref; } } # Predetermine which char mappings are greater than others my %greater; for my $i (0 .. $#order - 1) { for my $j ($i + 1 .. $#order) { my $gt = is_greater(@order[$i, $j]) or next; my ($lg, $sm) = $gt == 1 ? ($i, $j) : ($j, $i); ++$greater{$order[$sm]}[CNT]; push @{$greater{$order[$sm]}[NODE]}, "$order[$lg]"; } } # A max depth watermark and a path representing that depth my ($max, $path) = (0, ''); # Work queue my @work = map [$_, 1, $_], keys %greater; while (@work) { my $item = pop @work; my ($cur_depth, $route, $last_node) = @{$item}[DEPTH, PATH, LA +ST]; ($max, $path) = ($cur_depth, $route) if $cur_depth > $max; my $left = $greater{$last_node}[CNT]; next if ! $left || $cur_depth + $left <= $max; push @work, map ["$route:$_", $cur_depth + 1, $_], @{$greater{ +$last_node}[NODE]}; } my $hidden_msg = join '', map $lookup{$_}, split /:/, $path; return $hidden_msg; } # All vals in 1 ref > corresponding vals in other ref sub is_greater { my ($ref1, $ref2) = @_; my $cmp = $ref1->[0] <=> $ref2->[0] or return; ($ref1->[$_] <=> $ref2->[$_]) == $cmp || return for 1 .. $#$ref1; return $cmp; }

Do you know of a more efficient way to do this? Are there ways to improve this approach? Do you know how dynamic programming might help (Some implementations for 2 strings I found used DP)? Do you know how this is useful or how can the approach be adapted and applied to be useful?

Cheers - L~R

Comment on Longest Common Subsequence
Download Code
Re: Longest Common Subsequence
by diotalevi (Canon) on May 11, 2006 at 22:58 UTC
    Allows chars to be absent from some strings

    You're allowing a kind of error? How many errors are allowed?

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      diotalevi,
      I hate it when things seem clear in my head and then someone shows how painfully obvious that it wasn't clear.
      HOUSEBOAT COMPUTER DOUBT
      The letter D does not appear in all strings. In my challenge, I took advantage of the fact that every string contained the exact same 36 characters to take some shortcuts. All I meant by a general solution is one that doesn't have restrictions on the presence of every character in any one string appearing in every other string.

      Obviously any character that isn't present in all strings can't be part of the LCS but there is no limit as to how many can be missing.

      Cheers - L~R

Re: Longest Common Subsequence
by Limbic~Region (Chancellor) on May 12, 2006 at 13:17 UTC
    All,
    I am happily suprised with how efficient my attempt is. I whipped up a brute-force implementation and I am not sure when it will ever finish. The pure regex brute-force solution by diotalevi is much better than this but still takes several minutes. Wow.

    Cheers - L~R

      Takes several minutes on what data? I've been improving myne too. Here's the latest. It has three improvements over the completely dumb version. If there are some characters that are not common to all strings, it makes a character class to try only the common things. When skipping forward to find a backreference it uses a hack that works around not being able to use back references in character classes. After matching a backreference it makes an assertion about how many characters remain in the current line. I'm not sure if they have much effect on the success part of the match but I'm hoping they turn out to make the failure part faster. I also commented it. ;-)

      [Updated: Here's the final version I ended up with. It has two optional parts each of which contribute a bit of overhead but might allow some branches to be ignored. It also has timing data.]

      $_ = <<"..."; CPD6Z98SB2KQNWV0F7Y1IX4GLRA5MTOJHE3U CXZOL6SUI2WTJ30HF519YPGBRNAK48MQVD7E T8COSQU6I2FJN40DKL157WVGPYXARZ3MBHE9 KNCWVZDSR5420LP91FIQGB7Y3A6J8MOUXTEH XF9C4PSDY62TWJ0QBN17IKG3OH8ALVRM5UEZ D9QCHUSN7TW2YZL0O831FGXIR6JA4P5MVBKE ZC7ISQUPK6N20OLV4T31G9FRXBAWM5YJHED8 Z3C7SJVODL25TRQ01HPWGNKXB4UA68YMI9EF BC9OXDHS2FI5Z6U0TYL1VPGQK7ANR38MEWJ4 K4TCQBHS2ZV7FXU0P8R1YGDON3A6JILM9EW5 ... # $_ = <<"..."; # HOUSEBOAT # COMPUTER # DOUBT # ... use Benchmark; use constant { BACKREF_HACK => 1, LENGTH_HACK => 1 }; $| = 1; my $charclass = common_charclass(/([^\n]+)/); # use re 'debug'; # rx(8,$charclass); # exit; for ( my $len = 1;; ++$len ) { my $rx = rx( $len, $charclass ); my $t0 = Benchmark->new; my @found = /$rx/; my $t1 = Benchmark->new; print timestr( timediff( $t1, $t0 ) ) . " "; if (@found) { print join( '', @found ) . "\n"; } else { print "\n"; last; } } sub uniq { my %seen; return grep !$seen{$_}++, @_; } sub max { my $max = shift @_; $max < $_ and $max = $_ for @_; return $max; } sub common_charclass { my %seen; ++$seen{$_} for map @$_, map [ uniq( split // ) ], @_; my $max = max( values %seen ); my @no = delete @seen{ grep $max != $seen{$_}, keys %seen } or return undef; return do { my ($solo) = keys %seen; $solo } if 1 == scalar keys % +seen; return '[' . join( '', sort keys %seen ) . ']'; } sub rx { # 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) C # 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) CP # 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) CPM # 0 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU) CPME # 1 wallclock secs ( 0.42 usr + 0.00 sys = 0.42 CPU) CS201 # 3 wallclock secs ( 2.93 usr + 0.00 sys = 2.93 CPU) CS201G # 18 wallclock secs (16.28 usr + 0.00 sys = 16.28 CPU) CS201GA # 78 wallclock secs (72.17 usr + 0.02 sys = 72.19 CPU) CS201GAM # 277 wallclock secs (263.39 usr + 0.08 sys = 263.47 CPU) CS201GA +ME # 3466 wallclock secs (3260.91 usr + 0.98 sys = 3261.89 CPU) # # real 64m3.333s # user 60m16.157s # sys 0m1.088s my ( $len, $char ) = @_; $char = '\S' if not defined $char; my $pat = "\\A.*?" . ( "($char).*?" x $len ) . "\n"; # Start a new line $pat .= "(?>(?:"; # Make all my assertions about the content of the line $pat .= join '', map { my $capture_num = $_; # Skip past stuff that doesn't match $$_ my $seek = BACKREF_HACK ? "(?>(?:(?!\\$_).)*)" : ".*?"; # Find $$_ my $found_it = "\\$_"; # If I'm too close to the end of the line and don't have # enough characters to match, I'll assert that I need that # many and bail if there aren't enough. my $enough_left_at_end; if ( LENGTH_HACK ) { if ( $_ < $len ) { my $must_be_at_least_this_long = $len - $_; $enough_left_at_end = "(?=.{$must_be_at_least_this_lon +g})"; } else { $enough_left_at_end = ""; } } else { $enough_left_at_end = ''; } # Match *this* "$seek$found_it$enough_left_at_end"; } 1 .. $len; # Skip to the end of the line. $pat .= "(?>.*)\n"; # Repeat lines til I get to the end $pat .= ")+)\\z"; return qr/(?-s)$pat/; }

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

        diotalevi,
        Takes several minutes on what data?

        I was referring to your CB comments on how long it took to do the data presented in the puzzle. Your regex brute-force solution is extremely fast in comparison to my brute-force approach. Mine involves generating all possible subsequence of the shortest string in descending order according to length and testing each one on the remaining strings.

        I've been improving myne too.

        I am not sure if you meant that I was improving mine and you were improving yours too or if you meant that my characterization of yours was based off an old version. If the former, I haven't been improving my own. I am completely amazed at how fast it runs. I was only building a brute-force approach to verify that it was actually correct.

        I am still hoping someone more knowledgeable than myself can answer some or all of the questions I posed at the end of the meditation.

        Cheers - L~R

Re: Longest Common Subsequence
by TedPride (Priest) on May 12, 2006 at 20:30 UTC
    EDIT: I seem to have misunderstood the problem. Apparently the characters of the subsequence don't actually have to be adjacent, just in the same order for all strings? This is going to take somewhat more thought. Ignore my code below that doesn't solve the problem.

    ----------

    This is not exactly a new problem, and it's relatively easy to solve, at least for two strings. Just sort all the character positions for each string in alphabetical order, then walk through them linearly.

    use strict; use warnings; my ($s1, $s2); for (1..10000) { $s1 .= ('A'..'Z')[rand 26]; } for (1..10000) { $s2 .= ('A'..'Z')[rand 26]; } print longest($s1, $s2); sub longest { my ($s1, $s2) = @_; my (@s1, @s2, $p1, $p2, $c, $r, $max); @s1 = sortpos($s1); @s2 = sortpos($s2); $p1 = 0; $p2 = 0; $max = 0; while ($p1 <= $#s1 && $p2 <= $#s2) { $c = match($s1, $s1[$p1], $s2, $s2[$p2]); if ($c > $max) { $r = substr($s1, $s1[$p1], $c); $max = $c; } if ((substr($s1, $s1[$p1]) cmp substr($s2, $s2[$p2])) == -1) { $p1++; } else { $p2++; } } return $r; } sub match { my $p1 = $_[1]; my $p2 = $_[3]; my $max = length($_[0]) - $p1 < length($_[2]) - $p2 ? (length($_[0]) - $p1 - 1) : (length($_[2]) - $p2 - 1); my $c = 0; for (0..$max) { last if substr($_[0], ($p1 + $_), 1) ne substr($_[2], ($p2 + $ +_), 1); $c++; } return $c; } sub sortpos { my $s = $_[0]; return sort { substr($s, $a) cmp substr($s, $b) } 0..(length($s)-1 +); }
    Problem is, the code for sorting the positions uses substr, which means it's copying large sections of the strings on each compare (for a complexity of n lg^2 n instead of n lg n?). Perhaps someone can rewrite the code to use character positions, but for now, this should do for proof of concept.

    It's easier to make this sort of thing efficient in C or C++.

      TedPride,
      I seem to have misunderstood the problem. Apparently the characters of the subsequence don't actually have to be adjacent, just in the same order for all strings?

      Correct, the Wikipedia entries warn not to confuse Longest Common Subsequence with Longest Common Substring but it is easy to do. I will be posting a modified version of my subsequence algorithm that generates the substring answer (which really is just a more constrained form).

      Cheers - L~R

Re: Longest Common Subsequence
by thundergnat (Deacon) on May 13, 2006 at 13:33 UTC

    Why couldn't you do something like this? (There may be some subtle flaw I am missing, but it seems like it should work...)

    use strict; use warnings; use Algorithm::Diff qw/LCS/; my @s1 = split //, <DATA>; while (<DATA>) { chomp; next unless length; my @s2 = split //, $_; @s1 = LCS( \@s1, \@s2 ); } print @s1; __DATA__ CPD6Z98SB2KQNWV0F7Y1IX4GLRA5MTOJHE3U CXZOL6SUI2WTJ30HF519YPGBRNAK48MQVD7E T8COSQU6I2FJN40DKL157WVGPYXARZ3MBHE9 KNCWVZDSR5420LP91FIQGB7Y3A6J8MOUXTEH XF9C4PSDY62TWJ0QBN17IKG3OH8ALVRM5UEZ D9QCHUSN7TW2YZL0O831FGXIR6JA4P5MVBKE ZC7ISQUPK6N20OLV4T31G9FRXBAWM5YJHED8 Z3C7SJVODL25TRQ01HPWGNKXB4UA68YMI9EF BC9OXDHS2FI5Z6U0TYL1VPGQK7ANR38MEWJ4 K4TCQBHS2ZV7FXU0P8R1YGDON3A6JILM9EW5
      thundergnat,
      . o O (because it doesn't produce the correct answer?) 'CS21GAME' ne 'CS201GAME'

      Cheers - L~R

Re: Longest Common Subsequence
by TedPride (Priest) on May 14, 2006 at 09:26 UTC
    Here's my version of the specific case (written from scratch):
    use strict; use warnings; my (@first, @pos, %chain, $c, $max, $mstring); chomp($_ = <DATA>); @first = split //, $_; while (<DATA>) { chomp; my %hash; @_ = split //, $_; $hash{$_[$_]} = $_ for 0..$#_; push @pos, \%hash; } for $c (@first) { for (@pos) { push @{$chain{$c}}, $_->{$c}; } } $max = 0; $mstring = ''; find('', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); print $mstring; sub find { my ($s, @p) = @_; my ($end, $cp, $c) = 1; CHAR: for $cp ($p[0]..35) { my @np = ($cp+1); $c = $first[$cp]; for (0..8) { next CHAR if $chain{$c}[$_] < $p[$_+1]; $np[$_+1] = $chain{$c}[$_]; } find($s.$c, @np); $end = 0; } if ($end && length($s) > $max) { $max = length($s); $mstring = $s; } } __DATA__ CPD6Z98SB2KQNWV0F7Y1IX4GLRA5MTOJHE3U CXZOL6SUI2WTJ30HF519YPGBRNAK48MQVD7E T8COSQU6I2FJN40DKL157WVGPYXARZ3MBHE9 KNCWVZDSR5420LP91FIQGB7Y3A6J8MOUXTEH XF9C4PSDY62TWJ0QBN17IKG3OH8ALVRM5UEZ D9QCHUSN7TW2YZL0O831FGXIR6JA4P5MVBKE ZC7ISQUPK6N20OLV4T31G9FRXBAWM5YJHED8 Z3C7SJVODL25TRQ01HPWGNKXB4UA68YMI9EF BC9OXDHS2FI5Z6U0TYL1VPGQK7ANR38MEWJ4 K4TCQBHS2ZV7FXU0P8R1YGDON3A6JILM9EW5
    This allows for any number of strings with only minor tweaking, but it does assume the conditions given in your original problem regarding all characters being present once and only once. Variable length strings won't be difficult, once the other conditions are met, but repeated chars will add a bit of complexity, and absent chars are a major pain if you're working with multiple strings. For instance, what if you have three strings:

    BCDE
    ACDE
    ABDE
    ABCD

    Looking at this, you might come to the conclusion that the sequence is supposed to be ABCDE. After all, each string is only missing one character, right? But how do you compute that automatically? If you base from string 1, you can pass over the missing B in string 2, the missing C in string 3, and the missing E in string 4. But you don't notice the missing A in string 1. You get the same problem basing from any other string - one character is always lost. I still can't figure out how to work around this problem, and if I can't figure it out on paper, I certainly can't program a solution.

Re: Longest Common Subsequence
by planetscape (Canon) on May 16, 2006 at 03:49 UTC

    Inspired by this writeup, I did a little digging, and found the following (very interesting sounding) article:

    Irving, Robert W., and Fraser, Campbell. "Two Algorithms for the Longest Common Subsequence of Three (or More) Strings." Lecture Notes In Computer Science; Vol. 644; Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching. London, UK: Springer-Verlag, 1992. Pages: 214 - 229

    On ACM
    Citations via CiteSeer

    Alas, I am not an ACM subscriber. Perhaps I'll get lucky using Inter-Library Loan. ;-)

    planetscape
Re: Longest Common Subsequence
by Limbic~Region (Chancellor) on May 16, 2006 at 19:35 UTC
    All,
    I am providing a more comprehensive explanation of how my algorithm works in case it is of interest to someone that doesn't want to be bothered with reverse engineering the code. I have changed some of the variable names and added additional comments. I didn't start out with a clear understanding myself of what I was doing so as the code changed over time, variables took on different meanings but their names didn't always change. I hope this makes things much clearer.

    Cheers - L~R

      If you understood the mental process I went through to create the Longest Common Subsequence, it should not be hard to see how to adapt it to the Longest Common Substring.

      Once we have a list of character positions across strings, we can create ordered piles where each mapping has a difference of 1 in each position with its adjacent neighbors. You can get rid of the need for looping to find mappings greater because you can predict the values they should contain. Then you just need to find the largest pile and that represents the LCS (S = substring).

      Arrays are replaced with hashes. Picking a hash key at random, you place it in a new pile and then predict the next mapping. If that key exists you place it on the same pile. You repeat this process until you have no more mappings. You then start back at the beginning of the pile and look for smaller items. Once no more items fit in that pile, you move on to the next hash key.

      Cheers - L~R

Re: Longest Common Subsequence
by Unanimous Monk (Sexton) on May 29, 2006 at 15:29 UTC
    Do you know how this is useful or how can the approach be adapted and applied to be useful?

    It might find a use in spam filtering. An email made it through my spam filter today, and the cause was randomly inserted letters into the key words that the spam filter looks for. I don't think this would necessarily solve that problem completely (there is still the issue of intensionally mispelling words, using different characters for different letters, etc), but it could be an improvement.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://548827]
Approved by kvale
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2014-12-22 05:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls