Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Tonight on #perl in freenode, a user asked "What is the fastest way to find the number of substrings two strings have in common?"

The user had working code but wanted to know if it could be made faster without relying on Inline::C. We asked a number of clarifying questions to flesh out the requirements:

  • Only the count is required, not the substrings
  • A substring that matches in multiple places counts only once
  • The substring length is user defined
  • Substrings, as defined by the requester, are contiguous. All substrings of 'ABC' are A, B, C, AB, BC, ABC
My idea was to generate an unpack template programmatically that would return all the substrings at once. It would then be a simple matter of returning the count of intersecting substrings. It seemed like this could be done without any explicit loops. My solution follows:

It seemed to me that the challenge would be much more interesting if the substring length were allowed to be a range so that's worth bonus points. What is the fastest solution you can come up with in pure perl?

Update: Added definition of substring as well as minor code change assigning hash slice to empty list (f00li5h++).

Update: 2007-04-04 08:25 EST - Thanks to everyone who has contributed so far. I will be posting a Benchmark after work so that folks who want to optimize for a given data set may.

Update: 2007-04-05 08:14 EST - Two days after scheduling to transfer my service to another provider, my ISP mysteriously starts having problems with my account - coincidence? I would appreciate it if someone could post a Benchmark assuming 5,000 pairs of strings ranging in length from 30 to 50 lowercase letters with a desired substring length ranging from 3 to 7.

Cheers - L~R

In reply to Challenge: Fast Common Substrings by Limbic~Region

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [erix]: then you might as well send that patch to the DBIC guys :)
    [Corion]: erix: Yeah, I just found that it has no documentation at all on how to circumvent/ eliminate "1+n SELECTs" by building a local hash... I guess I have to make ->has_many do the hash lookup instead of doing the SQL query. But as the problem ...
    [Corion]: ... has only manifested itself so far through the puzzled questions of other bystanders, I won't go deeper at this time. But the DBIx::Class documentation could well do with a document on how to make "it" (that is, ORMs in general) faster ;)
    [Corion]: I find that DBIx::Class, like most ORMs makes things easy until they become performance critical and then makes it horribly hard to change things because the design is highly inflexible if you don't already know about the problems of 1+n :-/

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (7)
    As of 2017-09-25 11:06 GMT
    Find Nodes?
      Voting Booth?
      During the recent solar eclipse, I:

      Results (279 votes). Check out past polls.