<?xml version="1.0" encoding="windows-1252"?>
<node id="508420" title="Re^3: Fast common substring matching" created="2005-11-14 16:42:55" updated="2005-11-14 11:42:55">
<type id="11">
note</type>
<author id="300037">
Roy Johnson</author>
<data>
<field name="doctext">
I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length).
&lt;p&gt;
Give it a whirl.
&lt;readmore&gt;
&lt;c&gt;
use warnings;
use strict;
use Time::HiRes;

if (@ARGV == 0) {
    print "Finds longest matching substring between any pair of test strings\n";
    print "in the given file. Pairs of lines are expected with the first of a\n";
    print "pair being the string name and the second the test string.";
    exit (1);
}

my $minmatch = 4;

my $startTime = [Time::HiRes::gettimeofday ()];

my @strings;
while (&lt;&gt;) {
  chomp(my $label = $_);
  chomp(my $string = &lt;&gt;);
  # Compute all substrings
  push @strings, map [substr($string, $_), $label, $_], 0..(length($string) - $minmatch);
}

print "Loaded. Sorting...\n";

@strings = sort {$a-&gt;[0] cmp $b-&gt;[0]} @strings;

print "Sorted. Finding matches...\n";

# Now walk through the list. The best match for each string will be the
# previous or next element in the list that is not from the original substring,
# so for each entry, just look for the next one. See how many initial letters
# match and track the best matches
my @matchdata = (0); # (length, index1-into-strings, index2-into-strings)
for my $i1 (0..($#strings - 1)) {
  my $i2 = $i1 + 1;
  ++$i2 while $i2 &lt;= $#strings and $strings[$i2][1] eq $strings[$i1][1];
  next if $i2 &gt; $#strings;
  my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ /^(\0*)/;
  if ($common &gt; $matchdata[0]) {
    @matchdata = ($common, [$i1, $i2]);
  }
  elsif ($common == $matchdata[0]) {
    push @matchdata, [$i1, $i2];
  }
}

print "Best match: $matchdata[0] chars\n";
for my $i (@matchdata[1..$#matchdata]) {
print "$strings[$i-&gt;[0]][1] starting at $strings[$i-&gt;[0]][2]"
 . " and $strings[$i-&gt;[1]][1] starting at $strings[$i-&gt;[1]][2]\n";
}

print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n";
&lt;/c&gt;
A test-data generating program follows
&lt;c&gt;
use warnings;
use strict;

my ($howmany, $howlong) = (20, 1000);
# Generate $howmany strings of $howlong characters
for my $s (1..$howmany) {
  print "'String $s'\n";
  my $str = '';
  $str .= (qw(A C G T))[rand 4] for 1..$howlong;
  print "$str\n";
}
&lt;/c&gt;
&lt;/readmore&gt;
&lt;!-- Node text goes above. Div tags should contain sig only --&gt;
&lt;div class="pmsig"&gt;&lt;div class="pmsig-300037"&gt;
&lt;hr&gt;
&lt;small&gt;&lt;b&gt;Caution:&lt;/b&gt; Contents may have been coded under pressure.&lt;/small&gt;
&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
485464</field>
<field name="parent_node">
492993</field>
</data>
</node>
