Re: Challenge: Fast Common Substrings

in reply to Challenge: Fast Common Substrings

Just for the sake of completeness: A fast and elegant algorithm for this is a tricky use of suffix trees. One concatenates the two strings of length n and m, say abcdef%efgab$. It is possible to construct a suffix tree of this string in O(n+m) (Ukkonen algorithm). To find the common substrings, one has then to search for nodes that have exactly two (or the number of strings) leafs belonging to the different words. The resulting suffix tree for "abcdef" and "efgab":

     |      |(3:cdef%efgab$)|leaf
     |(1:ab)|
     |      |(13:$)|leaf
tree:|
     |     |(3:cdef%efgab$)|leaf
     |(2:b)|
     |     |(13:$)|leaf
     |
     |(3:cdef%efgab$)|leaf
     |
     |(4:def%efgab$)|leaf
     |
     |      |(7:%efgab$)|leaf
     |(5:ef)|
     |      |(10:gab$)|leaf
     |
     |     |(7:%efgab$)|leaf
     |(6:f)|
     |     |(10:gab$)|leaf
     |
     |(7:%efgab$)|leaf
     |
     |(10:gab$)|leaf
     |
[download]

So "ab" has two leafs in the different words (position <= 7 for leaf 1 and position > 7 for leaf 2). So have 'b', 'ef' and 'f'.

http://en.wikipedia.org/wiki/Longest_common_substring_problem

Update: Just found some perl code with google ... on perlmonks ;) Re: finding longest common substring

Comment on Re: Challenge: Fast Common Substrings Download Code

Replies are listed 'Best First'.
Re^2: Challenge: Fast Common Substrings by blokhead (Monsignor) on Apr 04, 2007 at 16:02 UTC
++ Wow, thank you for introducing me to suffix trees. What an interesting concept, and how refreshing to see a linear-time algorithm for constructing such a creature. I see you've used the javascript applet at this page, which others may want to check out. However, I'd like to slightly revise the algorithm you outlined. Consider the following example: `string = ababc%bc$ \| \|(3:abc%bc$)\|leaf \|(1:ab)\| \| \|(5:c%bc$)\|leaf tree:\| \| \|(3:abc%bc$)\|leaf \|(2:b)\| \| \| \|(6:%bc$)\|leaf \| \|(5:c)\| \| \| \|(9:$)\|leaf \| \| \|(6:%bc$)\|leaf \|(5:c)\| \| \|(9:$)\|leaf \| \|(6:%bc$)\|leaf \| \|(9:$)\|leaf` [download] "ab" appears twice in the first string, and so it gives a node with two leaves. The actual condition you should check is whether a node has one leaf containing the % separator and another leaf without the % symbol. blokhead	[reply] [d/l]
Re^3: Challenge: Fast Common Substrings (O(n)?) by tye (Sage) on Apr 04, 2007 at 21:53 UTC
The page you link to mentions being able to build them in O(n) but then only really describes how to go from a suffix tree for string $x to one for string $x.$c (1==length$c) in O(length $x). Using that algorithm would require O(N*N) to build the suffix tree for a string of length N. So I'm not sure I believe the O(N) claim for building the whole suffix tree based on that page. - tye	[reply]
Re^4: Challenge: Fast Common Substrings (O(n)?) by lima1 (Curate) on Apr 04, 2007 at 22:12 UTC
The naive algorithm requires O(N*N). The Ukkonen algorithm needs only O(N). If you want to understand it - it is not trivial - I recommend Gusfields book (Algorithms on Strings,...).	[reply]
Re^3: Challenge: Fast Common Substrings by lima1 (Curate) on Apr 04, 2007 at 16:25 UTC
Or even easier: check the positions of the substrings (<=7 and > 7 in my example).	[reply]

In Section Seekers of Perl Wisdom