comment on

All,
Tonight on #perl in freenode, a user asked "What is the fastest way to find the number of substrings two strings have in common?"

The user had working code but wanted to know if it could be made faster without relying on Inline::C. We asked a number of clarifying questions to flesh out the requirements:

Only the count is required, not the substrings
A substring that matches in multiple places counts only once
The substring length is user defined
Substrings, as defined by the requester, are contiguous. All substrings of 'ABC' are A, B, C, AB, BC, ABC

My idea was to generate an unpack template programmatically that would return all the substrings at once. It would then be a simple matter of returning the count of intersecting substrings. It seemed like this could be done without any explicit loops. My solution follows:

It seemed to me that the challenge would be much more interesting if the substring length were allowed to be a range so that's worth bonus points. What is the fastest solution you can come up with in pure perl?

Update: Added definition of substring as well as minor code change assigning hash slice to empty list (f00li5h++).

Update: 2007-04-04 08:25 EST - Thanks to everyone who has contributed so far. I will be posting a Benchmark after work so that folks who want to optimize for a given data set may.

Update: 2007-04-05 08:14 EST - Two days after scheduling to transfer my service to another provider, my ISP mysteriously starts having problems with my account - coincidence? I would appreciate it if someone could post a Benchmark assuming 5,000 pairs of strings ranging in length from 30 to 50 lowercase letters with a desired substring length ranging from 3 to 7.

Cheers - L~R

In reply to Challenge: Fast Common Substrings by Limbic~Region

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks