Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re: Challenge: Fast Common Substrings

by lima1 (Curate)
on Apr 04, 2007 at 15:50 UTC ( #608306=note: print w/replies, xml ) Need Help??

in reply to Challenge: Fast Common Substrings

Just for the sake of completeness: A fast and elegant algorithm for this is a tricky use of suffix trees. One concatenates the two strings of length n and m, say abcdef%efgab$. It is possible to construct a suffix tree of this string in O(n+m) (Ukkonen algorithm). To find the common substrings, one has then to search for nodes that have exactly two (or the number of strings) leafs belonging to the different words. The resulting suffix tree for "abcdef" and "efgab":
| |(3:cdef%efgab$)|leaf |(1:ab)| | |(13:$)|leaf tree:| | |(3:cdef%efgab$)|leaf |(2:b)| | |(13:$)|leaf | |(3:cdef%efgab$)|leaf | |(4:def%efgab$)|leaf | | |(7:%efgab$)|leaf |(5:ef)| | |(10:gab$)|leaf | | |(7:%efgab$)|leaf |(6:f)| | |(10:gab$)|leaf | |(7:%efgab$)|leaf | |(10:gab$)|leaf |
So "ab" has two leafs in the different words (position <= 7 for leaf 1 and position > 7 for leaf 2). So have 'b', 'ef' and 'f'.

Update: Just found some perl code with google ... on perlmonks ;) Re: finding longest common substring

Replies are listed 'Best First'.
Re^2: Challenge: Fast Common Substrings
by blokhead (Monsignor) on Apr 04, 2007 at 16:02 UTC
    ++ Wow, thank you for introducing me to suffix trees. What an interesting concept, and how refreshing to see a linear-time algorithm for constructing such a creature. I see you've used the javascript applet at this page, which others may want to check out.

    However, I'd like to slightly revise the algorithm you outlined. Consider the following example:

    string = ababc%bc$ | |(3:abc%bc$)|leaf |(1:ab)| | |(5:c%bc$)|leaf tree:| | |(3:abc%bc$)|leaf |(2:b)| | | |(6:%bc$)|leaf | |(5:c)| | | |(9:$)|leaf | | |(6:%bc$)|leaf |(5:c)| | |(9:$)|leaf | |(6:%bc$)|leaf | |(9:$)|leaf
    "ab" appears twice in the first string, and so it gives a node with two leaves. The actual condition you should check is whether a node has one leaf containing the % separator and another leaf without the % symbol.


      The page you link to mentions being able to build them in O(n) but then only really describes how to go from a suffix tree for string $x to one for string $x.$c (1==length$c) in O(length $x). Using that algorithm would require O(N*N) to build the suffix tree for a string of length N.

      So I'm not sure I believe the O(N) claim for building the whole suffix tree based on that page.

      - tye        

        The naive algorithm requires O(N*N). The Ukkonen algorithm needs only O(N). If you want to understand it - it is not trivial - I recommend Gusfields book (Algorithms on Strings,...).
      Or even easier: check the positions of the substrings (<=7 and > 7 in my example).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://608306]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2018-01-17 14:19 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (200 votes). Check out past polls.