Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I'd build a trie but have it track how many strings so far have contained each substring. Then I'd track the length of the longest substring so far found in enough strings. When a substring's find count reaches the threshold, if it is the same length as the longest, then I'd add it to the "found" list. If it is longer, then I'd update the max length and replace the "found" list with just that one substring.

Since a single substring could appear multiple times in each string, you also need to track the substrings that you have, so far, entered into the trie for the current string (which I'd just do in a hash).

That method is efficient enough that I ran it against all 172,819 words in my copy of the "enable1" word list, and just had it simultaneously compute the answers for all counts. Producing, in part:

Found 1 longest substrings in at least 120964 different strings: e Found 1 more (2 total) in at least 103913 different strings: s Found 1 more (3 total) in at least 102028 different strings: i Found 1 more (4 total) in at least 93844 different strings: a Found 1 more (5 total) in at least 90701 different strings: r Found 1 more (6 total) in at least 84198 different strings: n Found 1 more (7 total) in at least 83358 different strings: t Found 1 more (8 total) in at least 79367 different strings: o ... Found 1 more (15 total) in at least 38116 different strings: g Found 1 longest substrings in at least 33923 different strings: es Found 1 more (2 total) in at least 31096 different strings: in Found 1 more (3 total) in at least 30960 different strings: er Found 1 more (4 total) in at least 22962 different strings: ti ... Found 1 more (14 total) in at least 15277 different strings: is Found 1 longest substrings in at least 15241 different strings: ing Found 1 more (2 total) in at least 7831 different strings: ess Found 1 more (3 total) in at least 7800 different strings: ion ... Found 1 more (8 total) in at least 6567 different strings: tio Found 1 longest substrings in at least 6330 different strings: tion Found 1 more (2 total) in at least 5771 different strings: ness Found 1 more (3 total) in at least 4474 different strings: atio Found 1 longest substrings in at least 4441 different strings: ation Found 1 more (2 total) in at least 3065 different strings: esses Found 1 more (3 total) in at least 2883 different strings: nesse Found 1 longest substrings in at least 2871 different strings: nesses Found 1 more (2 total) in at least 1968 different strings: ations Found 1 more (3 total) in at least 1030 different strings: ically Found 1 more (4 total) in at least 873 different strings: zation Found 1 longest substrings in at least 869 different strings: ization Found 1 more (2 total) in at least 724 different strings: ousness Found 1 more (3 total) in at least 722 different strings: enesses Found 1 more (4 total) in at least 592 different strings: inesses Found 1 more (5 total) in at least 570 different strings: ilities Found 1 longest substrings in at least 523 different strings: bilities Found 1 more (2 total) in at least 431 different strings: izations Found 1 longest substrings in at least 393 different strings: abilities Found 1 more (2 total) in at least 362 different strings: ousnesses Found 1 more (3 total) in at least 270 different strings: alization Found 1 more (4 total) in at least 258 different strings: ification Found 1 more (5 total) in at least 219 different strings: ivenesses Found 1 more (6 total) in at least 214 different strings: blenesses Found 1 more (7 total) in at least 181 different strings: lizations Found 1 longest substrings in at least 169 different strings: ablenesses Found 1 more (2 total) in at least 162 different strings: iousnesses Found 1 more (3 total) in at least 139 different strings: lessnesses Found 1 more (4 total) in at least 135 different strings: alizations Found 1 more (5 total) in at least 130 different strings: tivenesses Found 1 more (6 total) in at least 126 different strings: ifications Found 1 more (7 total) in at least 86 different strings: tabilities Found 2 more (9 total) in at least 82 different strings: sivenesses ologically Found 1 more (10 total) in at least 80 different strings: ographical Found 1 more (11 total) in at least 67 different strings: tification Found 2 more (13 total) in at least 60 different strings: nstitution nalization Found 1 more (14 total) in at least 55 different strings: rabilities Found 3 more (17 total) in at least 54 different strings: stitutiona rousnesses raphically Found 2 longest substrings in at least 53 different strings: stitutional graphically Found 1 longest substrings in at least 51 different strings: nstitutional Found 1 more (2 total) in at least 48 different strings: ographically Found 1 longest substrings in at least 41 different strings: nstitutionali Found 1 more (2 total) in at least 31 different strings: institutional Found 1 longest substrings in at least 27 different strings: institutionali Found 1 more (2 total) in at least 25 different strings: nstitutionaliz Found 1 more (3 total) in at least 20 different strings: constitutional Found 1 longest substrings in at least 19 different strings: institutionaliz Found 1 more (2 total) in at least 14 different strings: constitutionali Found 1 more (3 total) in at least 13 different strings: nstitutionalize Found 1 longest substrings in at least 12 different strings: einstitutionaliz Found 1 longest substrings in at least 10 different strings: electroencephalogra Found 1 longest substrings in at least 8 different strings: electroencephalograph Found 1 more (2 total) in at least 4 different strings: einstitutionalization Found 1 longest substrings in at least 3 different strings: electroencephalographi Found 1 longest substrings in at least 2 different strings: ethylenediaminetetraacetate

Note that I was lazy and had it just track (eventually) all 120,964 "answers" and whittled it down to the 91 different answers as the last step. It'd be significantly even more efficient (in both space and time) if I had only tracked the unique answers as it went along. But that probably would have taken more of my time in programming than just waiting a few minutes for the above results took. The script as-is took very little time to write.

No, I didn't include the code. This looks like a homework problem so I won't be posting the code for at least a couple of weeks.

- tye        


In reply to Re: longest common substring (with needed tweaks) (trie) by tye
in thread longest common substring (with needed tweaks) by R56

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-24 13:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found