Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

LCCS time complexity

by rkg (Hermit)
on Sep 09, 2003 at 13:20 UTC ( #290025=perlquestion: print w/replies, xml ) Need Help??
rkg has asked for the wisdom of the Perl Monks concerning the following question:

Hi --

I've been using String-LCSS and it is sloooooow on long inputs.

I haven't studied the algorithm or the implementation, but just using it suggests it is O(n^2) or worse.

Is this just the nature of the LCSS problem, or is there a better implementation around?

Are there any fast approximation algorithms for LCCS? That is, if I can settle for a pretty-long-but-not-necessarily-the-mathematically-longest common substring ("PLBNNTMLCSS" vs "LCCS"), are there good heuristics around?

I'm not looking to write a generalized heurisitic around this (tabu or simulated annealing or whatever) at this point.



P.S. I brought this up on Perlmonks before, but there's confusion between LCSS and LCS. Different problems. (At least I think they are. Maybe I am missing something and LCSS can be derived from LCS?)

Replies are listed 'Best First'.
Re: LCCS time complexity
by tachyon (Chancellor) on Sep 09, 2003 at 14:24 UTC

    Maybe Algotithm::Diff will perform better. It isnt C but always seems to perform its job well and fast. If you are trying to do this on gene sequences you probably want to do it in C - once you have found an efficient algoritm. It does LCS native and can be easily made to do LCSS. Yes LCS and LCSS are different. The lcss method returns all the common substrings, unsorted. Just iterate through the list and remeber the index of the longest (much more efficient than a sort). You would want a Schwartzian transform on the sort by length if you want to use the multiple results (but if you want more than one LCSS its kinda a contradiction in terms).

    NB: Edge case where two (L)CSS are same length second ignored in my simple loop.

    use Algorithm::Diff qw(LCS traverse_sequences); my @seq1 = split //, 'abcdefghijklmnopqrstuvwxyz'; my @seq2 = split //,'flubberabcdubberdofghijklm'; my $lcs = join '', LCS( \@seq1, \@seq2 ); print "LCS: $lcs\n"; my $lcss = lcss( \@seq1, \@seq2 ); print "All CSS: @$lcss\n"; my $index; my $length = 0; for( my $i = 0; $i < @$lcss; $i++ ) { next unless length($lcss->[$i])>$length; $index = $i; $length = length($lcss->[$i]); } print "LCSS: ", $lcss->[$index], "\n"; # optional sort method print "LCSS: ", (sort{length{$a}<=>length{$b}}@$lcss)[-1]; sub lcss { my ( $seq1, $seq2 ) = @_; my ( @match, $from_match ); my $i = 0; traverse_sequences( $seq1, $seq2, { MATCH => sub { $match[$i] .= $seq1->[$_[0]]; $from_match = 1 } +, DISCARD_A => sub { do{$i++; $from_match = 0} if $from_match }, DISCARD_B => sub { do{$i++; $from_match = 0} if $from_match }, }); return \@match; } __DATA__ LCS: abcdefghijklm All CSS: abcd e fghijklm LCSS: fghijklm LCSS: fghijklm




      This code runs so much faster my head spins. Wow. Thanks. rkg

        Well actually dominus wrote the module, bikenomad maintains it and merlyn brought it to my attention. Glad it works better. Might as well make it a module....

        package String::LCSS::Fast; use 5.006; use strict; use warnings; use Algorithm::Diff qw(traverse_sequences); require Exporter; use vars qw( @ISA @EXPORT_OK $VERSION ); our @ISA = qw(Exporter); @EXPORT_OK = qw( LCSS CSS CSS_Sorted ); $VERSION = '0.01'; sub _tokenize { [split //, $_[0]] } sub CSS { my $is_array = ref $_[0] eq 'ARRAY' ? 1 : 0; my $sort = $_[2]; my ( $seq1, $seq2, @match, $from_match ); my $i = 0; if ( $is_array ) { $seq1 = $_[0]; $seq2 = $_[1]; traverse_sequences( $seq1, $seq2, { MATCH => sub { push @{$match[$i]}, $seq1->[$_[0]]; $from_m +atch = 1 }, DISCARD_A => sub { do{$i++; $from_match = 0} if $from_matc +h }, DISCARD_B => sub { do{$i++; $from_match = 0} if $from_matc +h }, }); } else { $seq1 = _tokenize($_[0]); $seq2 = _tokenize($_[1]); traverse_sequences( $seq1, $seq2, { MATCH => sub { $match[$i] .= $seq1->[$_[0]]; $from_match = + 1 }, DISCARD_A => sub { do{$i++; $from_match = 0} if $from_matc +h }, DISCARD_B => sub { do{$i++; $from_match = 0} if $from_matc +h }, }); } return \@match; } sub CSS_Sorted { my $match = CSS(@_); if ( ref $_[0] eq 'ARRAY' ) { @$match = map{$_->[0]}sort{$b->[1]<=>$a->[1]}map{[$_,scalar(@$_ +)]}@$match } else { @$match = map{$_->[0]}sort{$b->[1]<=>$a->[1]}map{[$_,length($_) +]}@$match } return $match; } sub LCSS { my $is_array = ref $_[0] eq 'ARRAY' ? 1 : 0; my $css = CSS(@_); my $index; my $length = 0; if ( $is_array ) { for( my $i = 0; $i < @$css; $i++ ) { next unless @{$css->[$i]}>$length; $index = $i; $length = @{$css->[$i]}; } } else { for( my $i = 0; $i < @$css; $i++ ) { next unless length($css->[$i])>$length; $index = $i; $length = length($css->[$i]); } } return $css->[$index]; } 1; __END__ =head1 NAME String::LCSS::Fast - Perl extension for getting the Longest Common Sub +String =head1 SYNOPSIS use String::LCSS::Fast qw( LCSS CSS CSS_Sorted ); =head1 DESCRIPTION This module uses Algoritm::Diff to implement LCSS and is orders of mag +nitude faster than String::LCSS. =head1 METHODS =head2 LCSS Returns the longest common sub string. If there may be more than one a +nd it matters use CSS instead. my $lcss_ary_ref = LCSS( \@SEQ1, \@SEQ2 ); # ref to array my $lcss_string = LCSS( $STR1, $STR2 ); # string =head2 CSS Returns all the common sub strings, unsorted. my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 ); # ref to array of arrays my $css_str_ref = CSS( $STR1, $STR2 ); # ref to array of string +s =head2 CSS_Sorted Returns all the common sub strings, sorted from longest to shortest my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 ); # ref to array of arrays my $css_str_ref = CSS( $STR1, $STR2 ); # ref to array of string +s =head1 EXPORT None by default. =head1 AUTHOR Dr James Freeman <> =head1 SEE ALSO L<perl>. =cut




Re: LCCS time complexity
by Zaxo (Archbishop) on Sep 09, 2003 at 14:18 UTC

    A quick look at the source shows nested for loops three deep, with the outer two iterating the length of each string. My eyeball guess is that the time complexity is (N*M**2)/2. The coding uses C style loops, and lexical $a, $b are used as temporary variables. A prime example of C written in perl.

    After Compline,

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://290025]
Approved by broquaint
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (13)
As of 2017-09-19 18:13 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (226 votes). Check out past polls.