in reply to
Re: Search for identical substrings
in thread Search for identical substrings
The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.
That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.
So, the name of this site notwithstanding, here's some C code, poorly tested
/* Longest Common Substring
*
* C Implementation of the algorithm found at:
* http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html
*/
/* This section is reproduced from the URL above.
*
* California law forbids anyone from making use of class
* lecture notes for commercial purposes, and such activity
* is expressly forbidden without the instructor's written
* permission.
*
Longest common substring problem
Given two sequences of letters, such as A = HELLO and
B = ALOHA, find the longest contiguous sequence appearing
in both.
One solution: (assume strings have lengths m and n)
For each of the m starting points of A, check for the
longest common string starting at each of the n starting
points of B.
The checks could average (m) time a total of (m2n) time.
Dynamic programming solution:
Let Li, j = maximum length of common strings that end at
A[i] & B[j].
Then,
A[i] = B[j] Li, j = 1 + Li1, j1
A[i] B[j] Li, j = 0
LONGEST COMMON SUBSTRING(A,m,B,n)
for i := 0 to m do Li,0 := 0
for j := 0 to n do L0,j := 0
len := 0
answer := <0,0>
for i := 1 to m do
for j := 1 to n do
if Ai Bj then
Li,j := 0
else
Li,j := 1 + Li1,j1
if Li,j > len then
len := Li,j
answer = <i,j>
Example:
A L O H A
H 0 0 0 1 0
E 0 0 0 0 0
L 0 1 0 0 0
L 0 1 0 0 0
O 0 0 2 0 0
*
*/
/* A type big enough to hold the string length. */
typedef long LCS_t;
/* A type that can hold any element of the string. */
typedef char LCS_c;
/* A should be a pointer to the beginning of the string.
* m should be the index of the last character in the
* string, plus one.
* Likewise for B and n.
* scratch must be an array of at least size m * n.
*/
LCS_t longest_common_substring
( LCS_c *A, LCS_t m,
LCS_c *B, LCS_t n,
LCS_t *scratch,
LCS_t *answer_x, LCS_t *answer_y
)
{
LCS_t i, j;
LCS_t len = 0;
*answer_x = *answer_y = 0;
#define L(x, y) scratch[(x) * n + (y)]
for (i = 0; i < m; i++) L(i, 0) = 0;
for (j = 0; j < n; j++) L(0, j) = 0;
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
if (A[i] != B[j])
{
L(i, j) = 0;
}
else
{
L(i, j) = 1 + L(i  1, j  1);
if (L(i, j) > len)
{
len = L(i, j);
*answer_x = i;
*answer_y = j;
}
}
}
}
#undef L
return len;
}
#include <malloc.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
if (argc != 3) return 2;
char *A = argv[1];
char *B = argv[2];
long m = strlen(A);
long n = strlen(B);
long *scratch = (long *) malloc(m * n * sizeof(long));
if (!scratch) return 1;
long answer_x, answer_y;
long len;
len = longest_common_substring(A, m, B, n, scratch,
&answer_x, &answer_y);
free(scratch);
printf("LCS is %ld chars ending at <%ld,%ld>.\n",
len, answer_x, answer_y);
return 0;
}
Maybe this would be a good time for me to learn Inline::C.
Re^3: Search for identical substrings by BrowserUk (Pope) on Aug 21, 2005 at 08:12 UTC 
#! perl slw
use strict;
#use Inline 'INFO';
use Inline C => 'DATA', NAME => 'LCS', CLEAN_AFTER_BUILD => 1;
my( $len, $offset0, $offset1 ) = LCS( @ARGV );
$ARGV[ 0 ] =~ s[(.{$offset0})(.{$len})][$1<$2>];
$ARGV[ 1 ] =~ s[(.{$offset1})(.{$len})][$1<$2>];
print for @ARGV;
__END__
[ 9:10:28.57] P:\test>DynLCSC hello aloha
hel<lo>
a<lo>ha
__C__
#define IDX( x, y ) (((y) * an)+(x))
/*
LONGEST COMMON SUBSTRING(A,m,B,n)
for i := 0 to m do Li,0 := 0
for j := 0 to n do L0,j := 0
len := 0
answer := <0,0>
for i := 1 to m do
for j := 1 to n do
if Ai ? Bj then
Li,j := 0
else
Li,j := 1 + Li1,j1
if Li,j > len then
len := Li,j
answer = <i,j>
*/
void LCS ( char* a, char*b ) {
Inline_Stack_Vars;
int an = strlen( a );
int bn = strlen( b );
int*L;
int len = 0;
int answer[2] = { 0,0 };
int i, j;
Newz( 42, L, an * bn, int );
for( i = 1; i < an; i++ ) {
for( j = 1; j < bn; j++ ) {
if( a[ i ] != b[ j ] ) {
L[ IDX(i,j) ] = 0;
}
else {
L[ IDX(i,j) ] = 1 + L[ IDX(i1, j1) ];
if( L[ IDX(i,j) ] > len ) { // xs(70)
len = L[ IDX(i,j) ];
answer[ 0 ] = i;
answer[ 1 ] = j;
}
}
}
}
Safefree( L );
Inline_Stack_Reset;
Inline_Stack_Push(sv_2mortal(newSViv( len )));
Inline_Stack_Push(sv_2mortal(newSViv( answer[ 0 ]  len + 1 )));
Inline_Stack_Push(sv_2mortal(newSViv( answer[ 1 ]  len + 1 )));
Inline_Stack_Done;
}
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] 

