If I understand correctly, you're interested in long substrings repeated numerous times. Here's a roughed-out approach based roughly on my Re^3: Fast common substring matching, which is, in turn, based roughly on the bzip algorithm.

For purposes of sanity, take a stab at some length of substring that you simply don't expect to be repeated. Maybe somewhere in the 100 to 1000 range. Make a file of all substrings of that length. (Unix) sort the lines of the file.

Now you just need to look at each line's neighbors to see how long their common prefix is. The longest repeated substring is guaranteed to be a common prefix of two neighboring lines.

To find how many times a substring is repeated, you'll have to keep checking subsequent lines until they don't have enough prefix in common to be interesting.

I haven't written any code for this, and probably won't. But I hope the description is helpful.

Update: I wrote some example code for finding the longest repeated substring. More: I enhanced it to generate a long string (60000 digits) and to limit the substring length, and to time itself. It runs in under 2 seconds on my PC.

```use strict;
use warnings;

use Time::HiRes;

my \$startTime = [Time::HiRes::gettimeofday ()];

my \$str = '';
\$str .= int rand 10 for 1..60000;

my @subs = sort map substr(\$str, \$_, 400), 0..length(\$str)-1;

# Find the longest common prefix between any two neighbors
my \$length = 0;
my \$value = '';
for my \$i (0..\$#subs-1) {
my (\$common) = map length, (\$subs[\$i] ^ \$subs[\$i+1]) =~ /^(\0*)/;
if (\$common > \$length) {
(\$length, \$value) = (\$common, substr(\$subs[\$i], 0, \$common));
}
}

print "\$value is the longest repeated substr\n";

print "Completed in " . Time::HiRes::tv_interval (\$startTime) . "\n";

Caution: Contents may have been coded under pressure.

In reply to Re: Longest repeated string... by Roy Johnson
in thread Longest repeated string... by Yzzyx

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Are you posting in the right place? Check out Where do I post X? to know for sure.
• Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
• Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
• Want more info? How to link or or How to display code and escape characters are good places to start.