Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^3: Fast common substring matching

by Roy Johnson (Monsignor)
on Nov 14, 2005 at 21:42 UTC ( #508420=note: print w/ replies, xml ) Need Help??

in reply to Re^2: Fast common substring matching
in thread Fast common substring matching

I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length).

Give it a whirl.

use warnings; use strict; use Time::HiRes; if (@ARGV == 0) { print "Finds longest matching substring between any pair of test s +trings\n"; print "in the given file. Pairs of lines are expected with the fir +st of a\n"; print "pair being the string name and the second the test string." +; exit (1); } my $minmatch = 4; my $startTime = [Time::HiRes::gettimeofday ()]; my @strings; while (<>) { chomp(my $label = $_); chomp(my $string = <>); # Compute all substrings push @strings, map [substr($string, $_), $label, $_], 0..(length($st +ring) - $minmatch); } print "Loaded. Sorting...\n"; @strings = sort {$a->[0] cmp $b->[0]} @strings; print "Sorted. Finding matches...\n"; # Now walk through the list. The best match for each string will be th +e # previous or next element in the list that is not from the original s +ubstring, # so for each entry, just look for the next one. See how many initial +letters # match and track the best matches my @matchdata = (0); # (length, index1-into-strings, index2-into-strin +gs) for my $i1 (0..($#strings - 1)) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i1][1 +]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ +/^(\0*)/; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } print "Best match: $matchdata[0] chars\n"; for my $i (@matchdata[1..$#matchdata]) { print "$strings[$i->[0]][1] starting at $strings[$i->[0]][2]" . " and $strings[$i->[1]][1] starting at $strings[$i->[1]][2]\n"; } print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n";
A test-data generating program follows
use warnings; use strict; my ($howmany, $howlong) = (20, 1000); # Generate $howmany strings of $howlong characters for my $s (1..$howmany) { print "'String $s'\n"; my $str = ''; $str .= (qw(A C G T))[rand 4] for 1..$howlong; print "$str\n"; }

Caution: Contents may have been coded under pressure.

Comment on Re^3: Fast common substring matching
Select or Download Code
Re^4: Fast common substring matching
by bioMan (Beadle) on Nov 29, 2005 at 16:53 UTC


    There is one difference between your algorithm and Grandfather's. His code returns the longest substring for each pair of input strings.

    With my original data set your code returns one substring. Grandfather's code returned over three thousand (where $minmatch = 256). On the other hand your code finds multiple occurrences of the longest common substrings, if they all have the same length, which I like.


      Yes, after I came up with my algorithm, I realized what all the output from GrandFather's code meant. I had thought it was just some sort of cryptic progress meter. :-)

      The (reasonably) obvious way to get the longest substring for each pair of input strings would be to run my algorithm using each pair of strings as input rather than the whole list of strings. That's probably more work than GF's method, though. I thought about trying it, but something shiny caught my attention...

      Update: but now I've done it. It runs on 20 strings of 1000 characters in something under 10 seconds for me. 100 strings of 1000 characters takes about 4 minutes.

      Caution: Contents may have been coded under pressure.

        I had thought it was just some sort of cryptic progress meter. :-)

        LOL - I know what you mean.

        I'm still going over your original code to see how you did what you did -- trying to learn some perl :-)

        I'll give the new code a try. I also see that the minimum length in your code doesn't have to be a power of 2. This should allow me to analyze a limit boundary that appears to be present in my data. Grandfather's code allowed me to come up with what I feel is a pretty good estimate for the value of the limit, but this should allow a closer examination of the limit.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://508420]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-07-31 04:20 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (244 votes), past polls