comment on

Following a question to the SeqFan mailing list, I'm trying to do some initial investigation of this problem:

Given n, across all the possible pairs of strings of length n, what is a(n), the maximum number of common subsequences of the same length as the longest common subsequence ('LCS')?

As an example, "aaab" and "aabb" have an LCS of length 3; there is only one distinct common subsequence, "aab", but that string can be extracted as a subsequence of "aaab" in 3 different ways, and of "aabb" in 2 different ways, giving 6 combinations: so a(4) >= 6.

Here's a routine to calculate the longest common subsequence, based on the algorithm described at the bottom of this page:

sub lcs {
  my($x, $y) = @_;
  my(@v, $cx, $cy, $left, $above);
  for my $xi (0 .. length($x) - 1) {
    $cx = substr $x, $xi, 1;
    for my $yi (0 .. length($y) - 1) {
      $cy = substr $y, $yi, 1;
      if ($cx eq $cy) {
        $v[$xi][$yi] = 1 + (($xi && $yi) ? $v[$xi - 1][$yi - 1] : 0);
      } else {
        $left = ($xi && $v[$xi - 1][$yi]) || 0;
        $above = ($xi && $v[$xi][$yi - 1]) || 0;
        $v[$xi][$yi] = ($left > $above) ? $left : $above; 
      }     
    }   
  } 
  return $v[length($x) - 1][length($y) - 1];
}
[download]

Now, counting the number of common subsequences of that length is tricky. The approach I've taken is to use Algorithm::Loops::NestedLoops to find all the distinct subsequences of the right length in the first string, and then count the number of ways that subsequence can be extracted from each of the two strings, multiply the results, and add to the sum. Here's "count the number of ways of matching a subsequence in a string":

sub matchss {
  my($ss, $str) = @_;
  my @state = (1, (0) x length($ss));
  my %index;
  unshift @{ $index{substr $ss, $_ - 1, 1} }, $_ for 1 .. length($ss);
  for (split //, $str) { 
    $state[$_] += $state[$_ - 1] for @{ $index{$_} || [] };
  } 
  pop @state;
}
[download]

This works by walking through the target string, counting the number of ways it could be matching up to any particular point in the subsequence as it goes.

Finally, the main subroutine which finds the distinct subsequences of the right length, and counts up the results:

sub lcscount {
  my($x, $y) = @_;
  my $n = lcs($x, $y) or return 1;
  my %seen;
  my $count = 0;
  my @x = split //, $x;
  NestedLoops(
    [   
      [ 0 .. $#x ],
      ( sub { [ $_ + 1 .. $#x ] } ) x ($n - 1), 
    ],  
    sub {
      my $ssx = join '', @x[@_];
      return if $seen{$ssx}++;
      $count += (matchss($ssx, $y) or return) * matchss($ssx, $x);
    },  
  );
  $count;
}
[download]

I originally had the $seen{$ssx}++ check in a separate OnlyWhen block, expecting that the second time it saw (say) "aa", the check there would stop A::L from generating all the "aaxyz..." subsequences, all of which would already have been catered for; rereading the docs however suggested that it wouldn't do this (is there any way to do this?) so I took it out.

As you may have guessed by now, this is my first attempt to use A::L and I don't by any means feel that I fully understand it, so any suggestions for improvements would be gratefully received.

Given that it is not aborting on repeated partial subsequences, the NestedLoops call is actually counting the same as matchss($ssx, $x), so the $count calculation could actually be moved to a later:

  $count += $seen{$_} * matchss($_, $y) for keys %seen;
[download]

.. but I'm holding out for a way to abort them, in which case I think the current approach would end up faster.

Hugo

In reply to Algorithm::Loops::NestedLoops and LCS sequence by hv

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks