Problems? Is your data what you think it is? PerlMonks

### Re^5: Fast common substring matching

by Roy Johnson (Monsignor)
 on Nov 29, 2005 at 17:08 UTC ( #512710=note: print w/replies, xml ) Need Help??

in reply to Re^4: Fast common substring matching
in thread Fast common substring matching

Yes, after I came up with my algorithm, I realized what all the output from GrandFather's code meant. I had thought it was just some sort of cryptic progress meter. :-)

The (reasonably) obvious way to get the longest substring for each pair of input strings would be to run my algorithm using each pair of strings as input rather than the whole list of strings. That's probably more work than GF's method, though. I thought about trying it, but something shiny caught my attention...

Update: but now I've done it. It runs on 20 strings of 1000 characters in something under 10 seconds for me. 100 strings of 1000 characters takes about 4 minutes.

```use warnings;
use strict;
use Time::HiRes;

if (@ARGV == 0) {
print "Finds longest matching substring between each pair of a set
+ of test\n";
print "strings in the given file. Pairs of lines are expected with
+ the first\n";
print "of a pair being the string name and the second the test str
+ing.";
exit (1);
}

my \$minmatch = 10;

my \$startTime = [Time::HiRes::gettimeofday ()];

my %strings;
while (<>) {
chomp(my \$label = \$_);
chomp(my \$string = <>);
# Compute all substrings
@{\$strings{\$label}} = map [substr(\$string, \$_), \$label, \$_], 0..(len
+gth(\$string) - \$minmatch);
}

my @keys = sort keys %strings;
my @best_overall_match = (0);
for my \$ki1 (0..(\$#keys - 1)) {
for my \$ki2 ((\$ki1 + 1)..\$#keys) {

my @strings = sort {\$a->[0] cmp \$b->[0]} @{\$strings{\$keys[\$ki1]}},
+ @{\$strings{\$keys[\$ki2]}};

# Now walk through the list. The best match for each string will b
+e the
# previous or next element in the list that is not from the origin
+al substring,
# so for each entry, just look for the next one. See how many init
+ial letters
# match and track the best matches
my @matchdata = (0); # (length, index1-into-strings, index2-into-s
+trings)
for my \$i1 (0..(\$#strings - 1)) {
my \$i2 = \$i1 + 1;
++\$i2 while \$i2 <= \$#strings and \$strings[\$i2][1] eq \$strings[\$i
+1][1];
next if \$i2 > \$#strings;
my (\$common) = map length, (\$strings[\$i1][0] ^ \$strings[\$i2][0])
+ =~ /^(\0*)/;
next if \$common < \$minmatch;
if (\$common > \$matchdata[0]) {
@matchdata = (\$common, [\$i1, \$i2]);
}
elsif (\$common == \$matchdata[0]) {
push @matchdata, [\$i1, \$i2];
}
}

next if \$matchdata[0] < \$minmatch;

if (\$matchdata[0] > \$best_overall_match[0]) {
@best_overall_match = (\$matchdata[0]);
}
if (\$matchdata[0] >= \$best_overall_match[0]) {
push @best_overall_match, map {
["\$strings[\$_->[0]][1]:\$strings[\$_->[0]][2]", "\$strings[\$_->[1
+]][1]:\$strings[\$_->[1]][2]"]
} @matchdata[1..\$#matchdata];
}

print "\$keys[\$ki1] and \$keys[\$ki2]: \$matchdata[0] chars\n";
for my \$i (@matchdata[1..\$#matchdata]) {
if (\$strings[\$i->[0]][1] eq \$keys[\$ki2]) {
@{\$i}[0,1] = @{\$i}[1,0];
}
print "... starting at \$strings[\$i->[0]][2] and \$strings[\$i->[1]
+][2], respectively.\n";
}
}
}

print "Best overall match: \$best_overall_match[0] chars\n";
print "\$_->[0] and \$_->[1]\n" for (@best_overall_match[1..\$#best_o
+verall_match]) ;

print "Completed in " . Time::HiRes::tv_interval (\$startTime) . "\n";

Caution: Contents may have been coded under pressure.

Replies are listed 'Best First'.
Re^6: Fast common substring matching
by bioMan (Beadle) on Nov 29, 2005 at 22:37 UTC

I had thought it was just some sort of cryptic progress meter. :-)

LOL - I know what you mean.

I'm still going over your original code to see how you did what you did -- trying to learn some perl :-)

I'll give the new code a try. I also see that the minimum length in your code doesn't have to be a power of 2. This should allow me to analyze a limit boundary that appears to be present in my data. Grandfather's code allowed me to come up with what I feel is a pretty good estimate for the value of the limit, but this should allow a closer examination of the limit.

Mike

Actually as far as I can remember my code doesn't require a power of 2 for the minimum size either. It may have been more important in earlier versions than in the current version.

Somewhere on my todo list is an item to look at Roy's code, but I've not got down to that item on the list yet. :)

DWIM is Perl's answer to Gödel

Thanks for the clarification. For some reason I got it in my head that the minimum length of the substring had to be a power of 2. That idea must have come from someone else's algorithm for the longest common string search.

Nontheless, your script has been very useful to me.

Mike

Re^6: Fast common substring matching
by marioroy (Deacon) on Feb 18, 2016 at 00:01 UTC

Update: Important on Windows is starting the shared-manager process immediately if construction for the shared variable comes after loading data. Unix platforms benefit from Copy-on-Write feature which is great.

```...

use MCE::Hobo;
use MCE::Shared;

# For minimum memory consumption, start the shared-manager process bef
+ore

MCE::Shared->start();   # <-- important on Windows

my \$minmatch = 4;

my \$startTime = [Time::HiRes::gettimeofday ()];

my %strings;
while (<>) {
chomp(my \$label = \$_);
chomp(my \$string = <>);
# Compute all substrings
@{\$strings{\$label}} = map [substr(\$string, \$_), \$label, \$_], 0..(len
+gth(\$string) - \$minmatch);
}

my @keys = sort keys %strings;

my \$sequence = MCE::Shared->sequence(
{ chunk_size => 1, bounds_only => 1 }, 0, \$#keys - 1
);

...

Hello Roy Johnson,

I am fascinated by the various examples posted here, here, and also the Inline C demonstration.

Your 2nd demonstration scales wonderfully on multiple cores after loading the strings hash. For testing, I made a file containing 48 sequences. The serial and parallel code complete in 22.6 seconds and 6.1 seconds respectively. My laptop has 4 real cores plus 4 hyper-threads.

First, the construction for MCE::Hobo. This requires a later 1.699_011 dev release or soon after the final MCE 1.7 release.

```...

my @keys = sort keys %strings;

# Now walk through the list. The best match for each string will be th
+e
# previous or next element in the list that is not from the original s
+ubstring,
# so for each entry, just look for the next one. See how many initial
+letters
# match and track the best matches

use MCE::Hobo;
use MCE::Shared;

my \$sequence = MCE::Shared->sequence(
{ chunk_size => 1, bounds_only => 1 }, 0, \$#keys - 1
);

sub walk_list {
my @best_overall_match = (0);

# \$beg and \$end have the same values when chunk_size => 1

while ( my ( \$beg, \$end ) = \$sequence->next ) {
for my \$ki1 ( \$beg .. \$end ) {
for my \$ki2 ((\$ki1 + 1)..\$#keys) {

my @strings = sort {\$a->[0] cmp \$b->[0]} @{\$strings{\$keys[\$ki1
+]}}, @{\$strings{\$keys[\$ki2]}};

my @matchdata = (0); # (length, index1-into-strings, index2-in
+to-strings)

for my \$i1 (0..(\$#strings - 1)) {
my \$i2 = \$i1 + 1;
++\$i2 while \$i2 <= \$#strings and \$strings[\$i2][1] eq \$string
+s[\$i1][1];
next if \$i2 > \$#strings;
my (\$common) = map length, (\$strings[\$i1][0] ^ \$strings[\$i2]
+[0]) =~ /^(\0*)/;
next if \$common < \$minmatch;
if (\$common > \$matchdata[0]) {
@matchdata = (\$common, [\$i1, \$i2]);
}
elsif (\$common == \$matchdata[0]) {
push @matchdata, [\$i1, \$i2];
}
}

next if \$matchdata[0] < \$minmatch;

if (\$matchdata[0] > \$best_overall_match[0]) {
@best_overall_match = (\$matchdata[0]);
}
if (\$matchdata[0] >= \$best_overall_match[0]) {
push @best_overall_match, map {
["\$strings[\$_->[0]][1]:\$strings[\$_->[0]][2]", "\$strings[\$_
+->[1]][1]:\$strings[\$_->[1]][2]"]
} @matchdata[1..\$#matchdata];
}

} # \$ki2
} # \$ki1
}

return @best_overall_match;
};

MCE::Hobo->create( \&walk_list ) for 1 .. 8;

my @best_overall_match = (0);

for my \$hobo ( MCE::Hobo->list ) {
my @ret = \$hobo->join;
if ( \$ret[0] > \$best_overall_match[0] ) {
@best_overall_match = @ret;
}
elsif ( \$ret[0] == \$best_overall_match[0] ) {
shift @ret;
push  @best_overall_match, @ret;
}
}

print "Best overall match: \$best_overall_match[0] chars\n";

...

MCE::Loop is next and does the same thing.

```...

my @keys = sort keys %strings;

# Now walk through the list. The best match for each string will be th
+e
# previous or next element in the list that is not from the original s
+ubstring,
# so for each entry, just look for the next one. See how many initial
+letters
# match and track the best matches

use MCE::Loop;

MCE::Loop::init(
max_workers => 8,
chunk_size  => 1,
bounds_only => 1,
);

my @ret = mce_loop_s {
my ( \$mce, \$seq, \$chunk_id ) = @_;
my @best_overall_match = (0);

# \$seq->[0] and \$seq->[1] have the same values when chunk_size => 1

for my \$ki1 ( \$seq->[0] .. \$seq->[1] ) {
for my \$ki2 ((\$ki1 + 1)..\$#keys) {

my @strings = sort {\$a->[0] cmp \$b->[0]} @{\$strings{\$keys[\$ki1]}
+}, @{\$strings{\$keys[\$ki2]}};

my @matchdata = (0); # (length, index1-into-strings, index2-into
+-strings)

for my \$i1 (0..(\$#strings - 1)) {
my \$i2 = \$i1 + 1;
++\$i2 while \$i2 <= \$#strings and \$strings[\$i2][1] eq \$strings[
+\$i1][1];
next if \$i2 > \$#strings;
my (\$common) = map length, (\$strings[\$i1][0] ^ \$strings[\$i2][0
+]) =~ /^(\0*)/;
next if \$common < \$minmatch;
if (\$common > \$matchdata[0]) {
@matchdata = (\$common, [\$i1, \$i2]);
}
elsif (\$common == \$matchdata[0]) {
push @matchdata, [\$i1, \$i2];
}
}

next if \$matchdata[0] < \$minmatch;

if (\$matchdata[0] > \$best_overall_match[0]) {
@best_overall_match = (\$matchdata[0]);
}
if (\$matchdata[0] >= \$best_overall_match[0]) {
push @best_overall_match, map {
["\$strings[\$_->[0]][1]:\$strings[\$_->[0]][2]", "\$strings[\$_->
+[1]][1]:\$strings[\$_->[1]][2]"]
} @matchdata[1..\$#matchdata];
}

} # \$ki2
} # \$ki1

MCE->gather(\@best_overall_match);

} 0, \$#keys - 1;

MCE::Loop::finish;

my @best_overall_match = (0);

for my \$i ( 0 .. \$#ret ) {
if (\$ret[\$i]->[0] > \$best_overall_match[0]) {
@best_overall_match = @{ \$ret[\$i] };
}
elsif ( \$ret[\$i]->[0] == \$best_overall_match[0] ) {
shift @{ \$ret[\$i] };
push  @best_overall_match, @{ \$ret[\$i] };
}
}

print "Best overall match: \$best_overall_match[0] chars\n";

...

This has been a lot of fun. I learned some more Perl from it all.

Regards, Mario

Create A New User
Node Status?
node history
Node Type: note [id://512710]
help
Chatterbox?
 [karlgoethebier]: is about to reach nirvana tonight... [Lady_Aleena]: It could have meant a "Miserably Cute Event" or "Man Crush Everyday". 8) [Corion]: choroba: Re the one-shot thing, I also thought of bit vectors and/or indexes into one common array from the hashes, but that makes maintenance of all these indices a chorse [Corion]: *core [Corion]: ** chore [Corion]: So I guess I will simply implement the linear scan first and wait with more fancy stuff until it becomes a problem [karlgoethebier]: Lady_Aleena: ++ for "The Man Crusher Everyday" [karlgoethebier]: this mad my day [karlgoethebier]: no typo [marioroy]: At the Fransiscan monastery, got stuck up high in a tree from pruning and the chainsaw with large branch fell and broke the latter, but not me fortunately. Was stuck there for a while until a firetruck came by.

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (13)
As of 2017-05-29 08:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My favorite model of computation is ...

Results (192 votes). Check out past polls.