If I understand correctly, you're interested in long substrings repeated numerous times. Here's a roughed-out approach based roughly on my Re^3: Fast common substring matching, which is, in turn, based roughly on the bzip algorithm.

For purposes of sanity, take a stab at some length of substring that you simply don't expect to be repeated. Maybe somewhere in the 100 to 1000 range. Make a file of all substrings of that length. (Unix) sort the lines of the file.

Now you just need to look at each line's neighbors to see how long their common prefix is. The longest repeated substring is guaranteed to be a common prefix of two neighboring lines.

To find how many times a substring is repeated, you'll have to keep checking subsequent lines until they don't have enough prefix in common to be interesting.

I haven't written any code for this, and probably won't. But I hope the description is helpful.

Update: I wrote some example code for finding the longest repeated substring. More: I enhanced it to generate a long string (60000 digits) and to limit the substring length, and to time itself. It runs in under 2 seconds on my PC.

use strict; use warnings; use Time::HiRes; my $startTime = [Time::HiRes::gettimeofday ()]; my $str = ''; $str .= int rand 10 for 1..60000; my @subs = sort map substr($str, $_, 400), 0..length($str)-1; # Find the longest common prefix between any two neighbors my $length = 0; my $value = ''; for my $i (0..$#subs-1) { my ($common) = map length, ($subs[$i] ^ $subs[$i+1]) =~ /^(\0*)/; if ($common > $length) { ($length, $value) = ($common, substr($subs[$i], 0, $common)); } } print "$value is the longest repeated substr\n"; print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n";

Caution: Contents may have been coded under pressure.

In reply to Re: Longest repeated string... by Roy Johnson
in thread Longest repeated string... by Yzzyx

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":