Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length).

Give it a whirl.

use warnings; use strict; use Time::HiRes; if (@ARGV == 0) { print "Finds longest matching substring between any pair of test s +trings\n"; print "in the given file. Pairs of lines are expected with the fir +st of a\n"; print "pair being the string name and the second the test string." +; exit (1); } my $minmatch = 4; my $startTime = [Time::HiRes::gettimeofday ()]; my @strings; while (<>) { chomp(my $label = $_); chomp(my $string = <>); # Compute all substrings push @strings, map [substr($string, $_), $label, $_], 0..(length($st +ring) - $minmatch); } print "Loaded. Sorting...\n"; @strings = sort {$a->[0] cmp $b->[0]} @strings; print "Sorted. Finding matches...\n"; # Now walk through the list. The best match for each string will be th +e # previous or next element in the list that is not from the original s +ubstring, # so for each entry, just look for the next one. See how many initial +letters # match and track the best matches my @matchdata = (0); # (length, index1-into-strings, index2-into-strin +gs) for my $i1 (0..($#strings - 1)) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i1][1 +]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ +/^(\0*)/; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } print "Best match: $matchdata[0] chars\n"; for my $i (@matchdata[1..$#matchdata]) { print "$strings[$i->[0]][1] starting at $strings[$i->[0]][2]" . " and $strings[$i->[1]][1] starting at $strings[$i->[1]][2]\n"; } print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n";
A test-data generating program follows
use warnings; use strict; my ($howmany, $howlong) = (20, 1000); # Generate $howmany strings of $howlong characters for my $s (1..$howmany) { print "'String $s'\n"; my $str = ''; $str .= (qw(A C G T))[rand 4] for 1..$howlong; print "$str\n"; }

Caution: Contents may have been coded under pressure.

In reply to Re^3: Fast common substring matching by Roy Johnson
in thread Fast common substring matching by GrandFather

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (12)
    As of 2014-08-29 19:11 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (287 votes), past polls