Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Search for identical substrings

by bioMan (Beadle)
on Aug 17, 2005 at 20:53 UTC ( #484593=perlquestion: print w/ replies, xml ) Need Help??
bioMan has asked for the wisdom of the Perl Monks concerning the following question:

I need to iterate over all the members of an array to search for the longest matching substrings between the members of the array without looking for matches between any array element and itself. Initially I used nested for loops, but because of the length of my strings (3k characters), and the number of elements (300) leads to prohibitive times for my search. It took a week just to check one element of the array against every other element.

I decided to see if grep and map could yield a better solution. My best attempt is below.

@array = (3..9, 9, 7, 65, 4); $min = 1; # in my final program this will be 200 map {for $line($_+1..$#array){print ("\n$_\t$line\t", grep /(.{$min,}) +.*\1/, $array[$_].$array[$line])}} (0..$#array-1);

It gives me the index of the array elements containing the matches and the concatenated string containing the match. My original program gave me $1, but I can't see how to do that with the code above. Also this code returns the indices for pairs of array elements that do not contain a match (I'd love for them to disappear). Is there a better way to do this? I really don't want to have to learn C.

I realize no solution will be particularly fast, but a complete search in two weeks or less wouldn't bother me.

bioMan

Comment on Search for identical substrings
Download Code
Re: Search for identical substrings
by GrandFather (Cardinal) on Aug 17, 2005 at 21:00 UTC

    Can you provide a sample of the type of data you are actually using. You talk of characters, but show numbers in your sample. The nature of the data may make a big difference to how the problem can be solved.


    Perl is Huffman encoded by design.

      The data I am working with are DNA sequences. (I know of the Bioperl project but have found nothing there that could help.) I have distilled the data to the DNA sequences alone. I have a second file with the sequence IDs, which is why I also print out the indicies from the matching strings. The data look like this.

      ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC
      ATGGAGAACATCACATCACGACTCCTAGGACCCCTTCACGTGAAACAGGC
      ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG
      ATGGTGTACATCACGACAGGATTCCTCGGAATCGCGCTGGTGACACAGGC

      With the sequence IDs the data would look like this.

      >seq1
      ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC
      >seq2
      ATGGAGAACATCACATCACGACTCCTAGGACCCCTTCACGTGAAACAGGC
      >seq3
      ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG
      >seq4
      ATGGTGTACATCACGACAGGATTCCTCGGAATCGCGCTGGTGACACAGGC

      ### update ###

      I have placed real data in my public scratchpad.

        So, just to see if I understand the task... given those four (truncated?) lines of input data, would the following be the "right" answer?
        LCS for 0 :: 1 = |ATGGAGAACATCACATCA| LCS for 0 :: 2 = |TCACATCAGGACTCCTAGGACC| LCS for 0 :: 3 = |CATCAC| LCS for 1 :: 2 = |ACTCCTAGGACC| LCS for 1 :: 3 = |CATCAC| LCS for 2 :: 3 = |CAGGA|
        This doesn't keep track of the actual index offsets where the longest match actually starts in each string for each pairwise comparison, but that would be easy to add.

        That's the output from the code posted in my later reply in this thread, given those four lines of sample data as input.

        I get these results:

        484593-3 Best match len:22 betwixt 0:10 & 2:10 ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC TCACATCAGGACTCCTAGGACC ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG TCACATCAGGACTCCTAGGACC 6 trials of N:4 L:50 MIN:2 ( 4.098ms total), 682us/trial

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Search for identical substrings
by graff (Chancellor) on Aug 17, 2005 at 21:42 UTC
    I think the code you posted doesn't make a lot of sense, given the problem that you described at the outset. If you're looking for the longest common substring in two strings that are each 3k characters long, and you're doing this across all distinct pairings of 300 such strings, you probably want to look at Algorithm::Diff and its "LCS" function.

    I haven't used it much myself; looking at the man page, it seems like you might need to split each string into an array of characters -- something like this (not tested):

    #!/usr/bin/perl use strict; use Algorithm::Diff qw/LCS/; my @strings; # fill @strings with your 300 elements of 3k characters each, then ... for my $i ( 0 .. $#strings-1 ) { my @seq1 = split //, $strings[$i]; for my $j ( $i+1 .. $#strings ) { my @seq2 = split //, $strings[$j]; my @lcs = LCS( \@seq1, \@seq2 ); print "LCS for $i :: $j is ", join( "",@lcs,"\n" ); } }
    I'm just guessing about that. Do heed Grandfather's request, and show us some data (and maybe some other code you've tried that really is more relevant).

    (updated code to fix a variable-name typo and the comment, and to declare @strings; also added spacing in the print statement, to avoid misuse of "::" as a namespace designation. Tested it on a sample text file (380 lines but less than 100 chars/line), and it seemed to do what's desired** -- not tremendously fast, but not impossibly slow.)

    (** second update: ** then again, looking at the output, it's not clear that my script is actually making correct use of the module -- I don't seem to be getting a single contiguous "LCS" string from each comparison as I was expecting, but rather a bunch of fragments that are each one or more characters long. Better read further in the man page...)

      I cobbled the code together by breaking my problem into discrete tasks. I then started writing code to accomplish each task and testing the code to ensure it worked. As I got each task working I added code to accomplish the next task. I admit it only makes sense to me because I know what the original problem was and the steps I took to solve each task. It aint pretty. I have been writing small perl scripts for more than a year but I still don't consider myself the be a perl programmer.

      I need to work through all the thoughtful suggestions I have received, and I thank everyone for their help.

      Ultimately all my code revolves around the regex

      /(.{$min,}).*\1/gi

      I realize that the regex is the rate limiting step in this process. Of course $min is necessary because allowing regex to look for all matches is prohibitive. Being ignorant of the subtleties of perl I don't know if there are other ways to effectively match substrings.

Re: Search for identical substrings
by QM (Vicar) on Aug 17, 2005 at 22:53 UTC
    There is not a "fast" way to do this, only "faster". See QOTW#E14 for how to do this with one string.

    I'm sure there are other ways, such as using XOR, maybe like this:

    my $xor = substr($data[$d1],$s1,$L) ^ substr($data[$d2],$s2,$L);
    See this for a working program:

    Update: The previous code only checked half of the possibilities. I've replaced it with this code:

    Update 2: The previous code missed multiple matches in the regex, fixed.

    Update 2 (continued: Comparing with some of the other entries, this performs moderately. On large input sets it's an order of magnitude slower than Grandfather's. It's weakness is that, once the LCS is found, it has to prove it's the longest by continuing to search.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Search for identical substrings
by graff (Chancellor) on Aug 17, 2005 at 23:12 UTC
    (Hope no one minds me making a second reply -- updating gets tiresome after a while...)

    Well, a lot really hinges on what you mean, exactly, when you say:

    ...search for the longest matching substrings between the members of the array...

    If what you mean is something like this:

    $string[0] = "abc fghijklmn "; $string[1] = " b d fgh jklmno"; # correct answer is: "jklmn", 5 chars long; # (next longest common substring is " fgh", # but you're not interested in that)
    then Algorithm::Diff can be used to arrive at that answer for each pair-wise string comarison as follows:
    #!/usr/bin/perl use strict; use Algorithm::Diff qw/traverse_sequences/; my @strings = <>; chomp @strings; my ( $longestMatch, $currentMatch ); my ( @seq1, @seq2 ); for my $i ( 0 .. $#strings-1 ) { @seq1 = split //, $strings[$i]; for my $j ( $i+1 .. $#strings ) { @seq2 = split //, $strings[$j]; $longestMatch = $currentMatch = ""; traverse_sequences( \@seq1, \@seq2, { MATCH => \&add2match, DISCARD_A => \&end_match, DISCARD_B => \&end_match } ); ### update: add a direct call to "end_match()" here, ### in case traverse was still matching at end-of-string: end_match( 'EOS' ); print "LCS for $i :: $j = |$longestMatch|\n"; } } sub add2match { my ( $ptrA, $ptrB ) = @_; $currentMatch .= $seq1[$ptrA]; # warn "match at i=$ptrA, j=$ptrB : =$seq1[$ptrA]= ; cm=$currentMat +ch=\n"; } sub end_match { $longestMatch = $currentMatch if ( length( $longestMatch ) < length( $currentMatch )); $currentMatch = ""; # warn "match ended at @_ : lm=$longestMatch=\n"; }
    If you take out the comment marks on the "warn" statements in the two callbacks, you'll be able to watch what's going on. With those commented out, a file of 382 lines (between 2 and 73 characters per line, about 73K pair-wise comparisons) was finished in about 6 minutes.

    I expect there are cleaner ways to do it (e.g. that don't involve global-scope variables), but this was fairly quick and easy (fun, even) for a first-time A::D user.

      If I'm reading the information right for Algorithm::Diff it will not return what I want. LCS appears to use a "distance" measure. That is it determines the distance over which two strings have the most information in common. This is done by computing hits and misses. The maximum hit count will give the longest distance over which the two strings share commonality. Usually these types of algorithms have a penalty for misses. Nonetheless as I understand LCS, if we are using the strings "banana is split" and "bananas split" we can line up the strings a couple of ways.

      banana is split bananas split banana--s # 7 characters in common and two misses "-" or banana is split bananas split ana--s split # 10 characters in common and two misses

      Allowing the strings to flex by putting holes in the strings we get...

      banana is split banana..s split # 13 characters in common and two "holes"

      ...by putting two holes between "a" and "s" in "bananas"

      So "bananas split" (removing the holes from "banana..s split") would be the result of LCS, but I want "s split" with a hit count of 7, no misses and no holes.

        ... but I want "s split" with a hit count of 7, no misses and no holes.

        Based on that, I checked that example against my script, and found that I needed to call "end_match()" after "travers_sequence()" was done, in case a match exends to the end of both strings.

        With that update, the script does what you want in this case. (But for the other sample data you posted above, I think it doesn't come up with the best answer in some cases, and I don't understand why, as yet.)

        The problem is that the "LCS" in Algorithm::Diff is Longest Common Sub-Sequence, but for your requirements you need Longest Common String.

        The difference being the latter is contiguous, whilst the former is not (need not be).

        You're after this?

        P:\test>lcs "banana is split" "banana..s split" banana is split banana..s split s split

        I'm still verifying my algorithm is correct, but so far it appears to be about 10x quicker than the XS version of LCSS despite being pure perl.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Search for identical substrings
by BrowserUk (Pope) on Aug 18, 2005 at 07:54 UTC
    ... the length of my strings (3k characters), and the number of elements (300) leads to prohibitive times for my search. It took a week just to check one element of the array against every other element.

    Can you confirm this please. Your current method took 1 week to do 299 LCSs. Which as you have (300 * 299)/2 = 44,850 to do, this would take 150 weeks to perform the processing?

    If so, I think I can help you. I believe I can get that down to 67 hours. But, as this is so much quicker (than both your current method and a couple of others I have tried), I would very much like to verify my program against some known data.

    So, if you could let us/me have say 5 of your 3k strings, and the LCS that your current method finds + the time taken, I could check what I have against your findings before exposing any stupidities to the world.

    TIA.

    Alternatively, I could provide my test data for you to try.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

      Your time estimates agree with mine. I calculated a time of completion of 3 years.

      Thank you for your offer. I would like to look at all my options first, including, abandoning the project, optimizing the data be removing redundant sequences (no easy task given the lack of documentation for some of my data), or subclassing the data into smaller sets of sequences.

      I would also like to look at the other responses I've received, but I will not forget your offer.

        Can you generate a data set that is representative of the problem and put it in your scratchpad?


        Perl is Huffman encoded by design.
Re: Search for identical substrings
by hv (Parson) on Aug 19, 2005 at 01:44 UTC

    I recommend taking a look at the "longest common substring" section of this page on Dynamic Programming. These algorithms are pretty simple, and pretty fast.

    Note that this type of iteration would benefit massively from converting to C - I'd recommend using Inline::C to convert just that one function.

    (If time permits over the weekend I might have a go at that. But it'd be nice to have a decent data set to test it against.)

    Hugo

      The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.

      That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.

      So, the name of this site notwithstanding, here's some C code, poorly tested

      Maybe this would be a good time for me to learn Inline::C.

        Here's my Inline C implementation.

        #! perl -slw use strict; #use Inline 'INFO'; use Inline C => 'DATA', NAME => 'LCS', CLEAN_AFTER_BUILD => 1; my( $len, $offset0, $offset1 ) = LCS( @ARGV ); $ARGV[ 0 ] =~ s[(.{$offset0})(.{$len})][$1<$2>]; $ARGV[ 1 ] =~ s[(.{$offset1})(.{$len})][$1<$2>]; print for @ARGV; __END__ [ 9:10:28.57] P:\test>DynLCS-C hello aloha hel<lo> a<lo>ha __C__ #define IDX( x, y ) (((y) * an)+(x)) /* LONGEST COMMON SUBSTRING(A,m,B,n) for i := 0 to m do Li,0 := 0 for j := 0 to n do L0,j := 0 len := 0 answer := <0,0> for i := 1 to m do for j := 1 to n do if Ai ? Bj then Li,j := 0 else Li,j := 1 + Li-1,j-1 if Li,j > len then len := Li,j answer = <i,j> */ void LCS ( char* a, char*b ) { Inline_Stack_Vars; int an = strlen( a ); int bn = strlen( b ); int*L; int len = 0; int answer[2] = { 0,0 }; int i, j; Newz( 42, L, an * bn, int ); for( i = 1; i < an; i++ ) { for( j = 1; j < bn; j++ ) { if( a[ i ] != b[ j ] ) { L[ IDX(i,j) ] = 0; } else { L[ IDX(i,j) ] = 1 + L[ IDX(i-1, j-1) ]; if( L[ IDX(i,j) ] > len ) { // xs(70) len = L[ IDX(i,j) ]; answer[ 0 ] = i; answer[ 1 ] = j; } } } } Safefree( L ); Inline_Stack_Reset; Inline_Stack_Push(sv_2mortal(newSViv( len ))); Inline_Stack_Push(sv_2mortal(newSViv( answer[ 0 ] - len + 1 ))); Inline_Stack_Push(sv_2mortal(newSViv( answer[ 1 ] - len + 1 ))); Inline_Stack_Done; }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Search for identical substrings
by GrandFather (Cardinal) on Aug 19, 2005 at 22:59 UTC

    Are there usefull constraints on the sub-string match such as a minimum interesting length, or a match may only start at every nth character or a constraint on the maximum match length?

    Note that it would help for people copying the data from your scratch pad if the data were between <code> </code> tags.

    Update: I guess the comment in the OP code means that 200 characters is the minimum usefull match?

    Perl is Huffman encoded by design.

      I have put <code> tages around the data on my scratchpad.

      Yes, a match below 200 charachers doesn't have much meaning. Strings below 200 charachters can occur by random chance with too high a frequency.

      I would also like to know how many different identical substrings greater than 200 characters exist between pairs of the 3k strings. I hope this statement makes sense.

Re: Search for identical substrings
by GrandFather (Cardinal) on Aug 21, 2005 at 02:19 UTC

    I've posted code here that seems to be about 7000 times faster than BrowserUk's solution result posted here and therefore about 3 million times faster than OP's original code.

    The six strings provided in bioMan's scratchpad were used as the test set with the same match result as BrowserUk's in about 0.01 seconds. Note that the claimed match position given in BrowserUk's result is wrong (checked by a manual examination of the data), but the match length is right.


    Perl is Huffman encoded by design.
      Note that the claimed match position given in BrowserUk's result is wrong (checked by a manual examination of the data), but the match length is right.

      The offsets were all +10 because I failed to remove the sequence labels from the strings. This is now corrected.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      I've posted code here that seems to be about 7000 times faster than BrowserUk's solution

      1. The results I posted were those for finding all the LCSs for all pairings. If I only search for the single longest LCS, and make the same minimum match assumption as you:
        [ 8:05:03.68] P:\test>484593-3 -MIN=128 bioman.dat Best match len:1271 between string 0:82 & string 2:82 15 trials of bioman.dat ( 5.443s total), 362.894ms/trial

        Which means your still 500+ X faster, but it allows me to save face (a little :).

      2. You're making an assumption in your code that you cannot make for a general solution, that of a minimum match length of 128, and produce no output if you set it too high.

        Whilst your comments say that this can be adjusted to a lower power of 2, when I tried this, I got strangely 'almost correct' values for both length and offset. I tracked this down to ...

      3. Your algorithm only produces the correct output if the input strings are an exact multiple of the minimum size specified.

        To demonstrate, I trimmed 1 character from the front of each line in biomain's data and ran your code with $subStrSize = 128 and got:

        P:\test>gf bioman2.dat Completed in 0.009606 Best match: >string 1 - >string 3 . 1273 characters starting at 80 an +d 80.

        Note that the length has increased by 2 when it should be the same, and both offsets have reduced by 2 when they should have reduced by 1?

        I also tried setting $subStrSize = 1, as thought that might correct the problem, but apart from running much more slowly, it still produced the same, slightly wrong output:

        [ 7:49:55.56] P:\test>gf bioman2.dat Completed in 76.466764 Best match: >string 1 - >string 3 . 1273 characters starting at 80 an +d 80.

      All that said, your algorithm is blazingly fast and the anomolies can probably be corrected.

      If so, or if bioman only needs the single LCS, and can live with specifying a minimum match and arranging for his input data to always be a multiple of that minimum size, it blow's what I have into dust.

      At least, until I can adapt my code to use elements of your search technique :). Nice++


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

        Actually the code does generate all the LCS's. I used to print those out. If you are interested you can add back the line by copying the print for best match, pasting it after the line $localBest = expandMatch (@$localBest); and replacing $bestMatch with $localBest. The run time remains about the same :).

        I noticed the anomolies, but was so excited by the run rate that I sort of forgot about them and posted the code. The problem will be in the expandMatch sub which does a binary search to extend the match at each end to the full extent of the matched run. It's probably trivial to fix and I'll have a look and update the posted code when I find the bug.

        The minimum match length is where the "blazingly fast" bit comes from. Although it still seems to be pretty fast even for small match lengths.

        This turned out so much faster than anything else that I couldn't quite believe it! I'm pleased that you have reproduced my results (bugs included).


        Perl is Huffman encoded by design.
Re: Search for identical substrings
by tlm (Prior) on Aug 21, 2005 at 06:07 UTC

    Why don't you use Bio::Tools::dpAlign? It implements the standard Needleman-Wunsch algorithm in C. Just give it a huge penalty for gaps.

    the lowliest monk

      I hadn't considered the Needleman-Wunsch because I forgot it was in BioPerl. I cobble scripts for minor text manipulations and haven't looked at too many modules including BioPerl. Being a biologist I really should give it a look.

      On the other hand, I must say, it has been fascinating to read all the responses. The posting perlmonks have brought a lot to my plate, and I'll be reading and re-reading the posts for some time. My near non-existant skills can only improve because of their discussions.

      Thanks for the tip. I'll read through the documentation for Bio::Tools::dpAlign.

      I've looked at the Needleman-Wunsch information and I can probably get what I want by using high gap, gap extention, and mismatch penalties. The problem I see is that I can't specify a lower limit for the length of the match string found by Needleman-Wunsch.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://484593]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2014-07-26 00:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls