Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: longest common substring (with needed tweaks) (trie)

by tye (Cardinal)
on Oct 27, 2013 at 20:14 UTC ( #1059932=note: print w/ replies, xml ) Need Help??


in reply to longest common substring (with needed tweaks)

I'd build a trie but have it track how many strings so far have contained each substring. Then I'd track the length of the longest substring so far found in enough strings. When a substring's find count reaches the threshold, if it is the same length as the longest, then I'd add it to the "found" list. If it is longer, then I'd update the max length and replace the "found" list with just that one substring.

Since a single substring could appear multiple times in each string, you also need to track the substrings that you have, so far, entered into the trie for the current string (which I'd just do in a hash).

That method is efficient enough that I ran it against all 172,819 words in my copy of the "enable1" word list, and just had it simultaneously compute the answers for all counts. Producing, in part:

Found 1 longest substrings in at least 120964 different strings: e Found 1 more (2 total) in at least 103913 different strings: s Found 1 more (3 total) in at least 102028 different strings: i Found 1 more (4 total) in at least 93844 different strings: a Found 1 more (5 total) in at least 90701 different strings: r Found 1 more (6 total) in at least 84198 different strings: n Found 1 more (7 total) in at least 83358 different strings: t Found 1 more (8 total) in at least 79367 different strings: o ... Found 1 more (15 total) in at least 38116 different strings: g Found 1 longest substrings in at least 33923 different strings: es Found 1 more (2 total) in at least 31096 different strings: in Found 1 more (3 total) in at least 30960 different strings: er Found 1 more (4 total) in at least 22962 different strings: ti ... Found 1 more (14 total) in at least 15277 different strings: is Found 1 longest substrings in at least 15241 different strings: ing Found 1 more (2 total) in at least 7831 different strings: ess Found 1 more (3 total) in at least 7800 different strings: ion ... Found 1 more (8 total) in at least 6567 different strings: tio Found 1 longest substrings in at least 6330 different strings: tion Found 1 more (2 total) in at least 5771 different strings: ness Found 1 more (3 total) in at least 4474 different strings: atio Found 1 longest substrings in at least 4441 different strings: ation Found 1 more (2 total) in at least 3065 different strings: esses Found 1 more (3 total) in at least 2883 different strings: nesse Found 1 longest substrings in at least 2871 different strings: nesses Found 1 more (2 total) in at least 1968 different strings: ations Found 1 more (3 total) in at least 1030 different strings: ically Found 1 more (4 total) in at least 873 different strings: zation Found 1 longest substrings in at least 869 different strings: ization Found 1 more (2 total) in at least 724 different strings: ousness Found 1 more (3 total) in at least 722 different strings: enesses Found 1 more (4 total) in at least 592 different strings: inesses Found 1 more (5 total) in at least 570 different strings: ilities Found 1 longest substrings in at least 523 different strings: bilities Found 1 more (2 total) in at least 431 different strings: izations Found 1 longest substrings in at least 393 different strings: abilities Found 1 more (2 total) in at least 362 different strings: ousnesses Found 1 more (3 total) in at least 270 different strings: alization Found 1 more (4 total) in at least 258 different strings: ification Found 1 more (5 total) in at least 219 different strings: ivenesses Found 1 more (6 total) in at least 214 different strings: blenesses Found 1 more (7 total) in at least 181 different strings: lizations Found 1 longest substrings in at least 169 different strings: ablenesses Found 1 more (2 total) in at least 162 different strings: iousnesses Found 1 more (3 total) in at least 139 different strings: lessnesses Found 1 more (4 total) in at least 135 different strings: alizations Found 1 more (5 total) in at least 130 different strings: tivenesses Found 1 more (6 total) in at least 126 different strings: ifications Found 1 more (7 total) in at least 86 different strings: tabilities Found 2 more (9 total) in at least 82 different strings: sivenesses ologically Found 1 more (10 total) in at least 80 different strings: ographical Found 1 more (11 total) in at least 67 different strings: tification Found 2 more (13 total) in at least 60 different strings: nstitution nalization Found 1 more (14 total) in at least 55 different strings: rabilities Found 3 more (17 total) in at least 54 different strings: stitutiona rousnesses raphically Found 2 longest substrings in at least 53 different strings: stitutional graphically Found 1 longest substrings in at least 51 different strings: nstitutional Found 1 more (2 total) in at least 48 different strings: ographically Found 1 longest substrings in at least 41 different strings: nstitutionali Found 1 more (2 total) in at least 31 different strings: institutional Found 1 longest substrings in at least 27 different strings: institutionali Found 1 more (2 total) in at least 25 different strings: nstitutionaliz Found 1 more (3 total) in at least 20 different strings: constitutional Found 1 longest substrings in at least 19 different strings: institutionaliz Found 1 more (2 total) in at least 14 different strings: constitutionali Found 1 more (3 total) in at least 13 different strings: nstitutionalize Found 1 longest substrings in at least 12 different strings: einstitutionaliz Found 1 longest substrings in at least 10 different strings: electroencephalogra Found 1 longest substrings in at least 8 different strings: electroencephalograph Found 1 more (2 total) in at least 4 different strings: einstitutionalization Found 1 longest substrings in at least 3 different strings: electroencephalographi Found 1 longest substrings in at least 2 different strings: ethylenediaminetetraacetate

Note that I was lazy and had it just track (eventually) all 120,964 "answers" and whittled it down to the 91 different answers as the last step. It'd be significantly even more efficient (in both space and time) if I had only tracked the unique answers as it went along. But that probably would have taken more of my time in programming than just waiting a few minutes for the above results took. The script as-is took very little time to write.

No, I didn't include the code. This looks like a homework problem so I won't be posting the code for at least a couple of weeks.

- tye        


Comment on Re: longest common substring (with needed tweaks) (trie)
Download Code
Re^2: longest common substring (with needed tweaks) (trie)
by R56 (Acolyte) on Oct 28, 2013 at 17:26 UTC

    Hey tye!

    I read some things on suffix and prefix trees, but didn't get very far. Still, your solution looks really good when dealing with a large number of strings. Fortunately I won't have to deal with that many :)

    It's not homework, it's to help me on my job, but thanks for the guidance!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1059932]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2014-09-03 07:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (35 votes), past polls