Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Yes, after I came up with my algorithm, I realized what all the output from GrandFather's code meant. I had thought it was just some sort of cryptic progress meter. :-)

The (reasonably) obvious way to get the longest substring for each pair of input strings would be to run my algorithm using each pair of strings as input rather than the whole list of strings. That's probably more work than GF's method, though. I thought about trying it, but something shiny caught my attention...

Update: but now I've done it. It runs on 20 strings of 1000 characters in something under 10 seconds for me. 100 strings of 1000 characters takes about 4 minutes.

use warnings; use strict; use Time::HiRes; if (@ARGV == 0) { print "Finds longest matching substring between each pair of a set + of test\n"; print "strings in the given file. Pairs of lines are expected with + the first\n"; print "of a pair being the string name and the second the test str +ing."; exit (1); } my $minmatch = 10; my $startTime = [Time::HiRes::gettimeofday ()]; my %strings; while (<>) { chomp(my $label = $_); chomp(my $string = <>); # Compute all substrings @{$strings{$label}} = map [substr($string, $_), $label, $_], 0..(len +gth($string) - $minmatch); } print "Loaded. Generating combos...\n"; my @keys = sort keys %strings; my @best_overall_match = (0); for my $ki1 (0..($#keys - 1)) { for my $ki2 (($ki1 + 1)..$#keys) { my @strings = sort {$a->[0] cmp $b->[0]} @{$strings{$keys[$ki1]}}, + @{$strings{$keys[$ki2]}}; # Now walk through the list. The best match for each string will b +e the # previous or next element in the list that is not from the origin +al substring, # so for each entry, just look for the next one. See how many init +ial letters # match and track the best matches my @matchdata = (0); # (length, index1-into-strings, index2-into-s +trings) for my $i1 (0..($#strings - 1)) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i +1][1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) + =~ /^(\0*)/; next if $common < $minmatch; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } next if $matchdata[0] < $minmatch; if ($matchdata[0] > $best_overall_match[0]) { @best_overall_match = ($matchdata[0]); } if ($matchdata[0] >= $best_overall_match[0]) { push @best_overall_match, map { ["$strings[$_->[0]][1]:$strings[$_->[0]][2]", "$strings[$_->[1 +]][1]:$strings[$_->[1]][2]"] } @matchdata[1..$#matchdata]; } print "$keys[$ki1] and $keys[$ki2]: $matchdata[0] chars\n"; for my $i (@matchdata[1..$#matchdata]) { if ($strings[$i->[0]][1] eq $keys[$ki2]) { @{$i}[0,1] = @{$i}[1,0]; } print "... starting at $strings[$i->[0]][2] and $strings[$i->[1] +][2], respectively.\n"; } } } print "Best overall match: $best_overall_match[0] chars\n"; print "$_->[0] and $_->[1]\n" for (@best_overall_match[1..$#best_o +verall_match]) ; print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n";

Caution: Contents may have been coded under pressure.

In reply to Re^5: Fast common substring matching by Roy Johnson
in thread Fast common substring matching by GrandFather

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others contemplating the Monastery: (6)
    As of 2014-12-29 11:25 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (186 votes), past polls