Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: Search for identical substrings

by TilRMan (Friar)
on Aug 21, 2005 at 05:57 UTC ( #485490=note: print w/ replies, xml ) Need Help??


in reply to Re: Search for identical substrings
in thread Search for identical substrings

The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.

That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.

So, the name of this site notwithstanding, here's some C code, poorly tested

/* Longest Common Substring * * C Implementation of the algorithm found at: * http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html */ /* This section is reproduced from the URL above. * * California law forbids anyone from making use of class * lecture notes for commercial purposes, and such activity * is expressly forbidden without the instructor's written * permission. * Longest common substring problem Given two sequences of letters, such as A = HELLO and B = ALOHA, find the longest contiguous sequence appearing in both. One solution: (assume strings have lengths m and n) For each of the m starting points of A, check for the longest common string starting at each of the n starting points of B. The checks could average (m) time a total of (m2n) time. Dynamic programming solution: Let Li, j = maximum length of common strings that end at A[i] & B[j]. Then, A[i] = B[j] Li, j = 1 + Li-1, j-1 A[i] B[j] Li, j = 0 LONGEST COMMON SUBSTRING(A,m,B,n) for i := 0 to m do Li,0 := 0 for j := 0 to n do L0,j := 0 len := 0 answer := <0,0> for i := 1 to m do for j := 1 to n do if Ai Bj then Li,j := 0 else Li,j := 1 + Li-1,j-1 if Li,j > len then len := Li,j answer = <i,j> Example: A L O H A H 0 0 0 1 0 E 0 0 0 0 0 L 0 1 0 0 0 L 0 1 0 0 0 O 0 0 2 0 0 * */ /* A type big enough to hold the string length. */ typedef long LCS_t; /* A type that can hold any element of the string. */ typedef char LCS_c; /* A should be a pointer to the beginning of the string. * m should be the index of the last character in the * string, plus one. * Likewise for B and n. * scratch must be an array of at least size m * n. */ LCS_t longest_common_substring ( LCS_c *A, LCS_t m, LCS_c *B, LCS_t n, LCS_t *scratch, LCS_t *answer_x, LCS_t *answer_y ) { LCS_t i, j; LCS_t len = 0; *answer_x = *answer_y = 0; #define L(x, y) scratch[(x) * n + (y)] for (i = 0; i < m; i++) L(i, 0) = 0; for (j = 0; j < n; j++) L(0, j) = 0; for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { if (A[i] != B[j]) { L(i, j) = 0; } else { L(i, j) = 1 + L(i - 1, j - 1); if (L(i, j) > len) { len = L(i, j); *answer_x = i; *answer_y = j; } } } } #undef L return len; } #include <malloc.h> #include <stdio.h> #include <string.h> int main(int argc, char **argv) { if (argc != 3) return 2; char *A = argv[1]; char *B = argv[2]; long m = strlen(A); long n = strlen(B); long *scratch = (long *) malloc(m * n * sizeof(long)); if (!scratch) return 1; long answer_x, answer_y; long len; len = longest_common_substring(A, m, B, n, scratch, &answer_x, &answer_y); free(scratch); printf("LCS is %ld chars ending at <%ld,%ld>.\n", len, answer_x, answer_y); return 0; }

Maybe this would be a good time for me to learn Inline::C.


Comment on Re^2: Search for identical substrings
Download Code
Re^3: Search for identical substrings
by BrowserUk (Pope) on Aug 21, 2005 at 08:12 UTC

    Here's my Inline C implementation.

    #! perl -slw use strict; #use Inline 'INFO'; use Inline C => 'DATA', NAME => 'LCS', CLEAN_AFTER_BUILD => 1; my( $len, $offset0, $offset1 ) = LCS( @ARGV ); $ARGV[ 0 ] =~ s[(.{$offset0})(.{$len})][$1<$2>]; $ARGV[ 1 ] =~ s[(.{$offset1})(.{$len})][$1<$2>]; print for @ARGV; __END__ [ 9:10:28.57] P:\test>DynLCS-C hello aloha hel<lo> a<lo>ha __C__ #define IDX( x, y ) (((y) * an)+(x)) /* LONGEST COMMON SUBSTRING(A,m,B,n) for i := 0 to m do Li,0 := 0 for j := 0 to n do L0,j := 0 len := 0 answer := <0,0> for i := 1 to m do for j := 1 to n do if Ai ? Bj then Li,j := 0 else Li,j := 1 + Li-1,j-1 if Li,j > len then len := Li,j answer = <i,j> */ void LCS ( char* a, char*b ) { Inline_Stack_Vars; int an = strlen( a ); int bn = strlen( b ); int*L; int len = 0; int answer[2] = { 0,0 }; int i, j; Newz( 42, L, an * bn, int ); for( i = 1; i < an; i++ ) { for( j = 1; j < bn; j++ ) { if( a[ i ] != b[ j ] ) { L[ IDX(i,j) ] = 0; } else { L[ IDX(i,j) ] = 1 + L[ IDX(i-1, j-1) ]; if( L[ IDX(i,j) ] > len ) { // xs(70) len = L[ IDX(i,j) ]; answer[ 0 ] = i; answer[ 1 ] = j; } } } } Safefree( L ); Inline_Stack_Reset; Inline_Stack_Push(sv_2mortal(newSViv( len ))); Inline_Stack_Push(sv_2mortal(newSViv( answer[ 0 ] - len + 1 ))); Inline_Stack_Push(sv_2mortal(newSViv( answer[ 1 ] - len + 1 ))); Inline_Stack_Done; }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://485490]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2014-10-22 07:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (114 votes), past polls