Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: finding longest common substring

by tilly (Archbishop)
on Nov 20, 2003 at 03:09 UTC ( #308474=note: print w/replies, xml ) Need Help??


in reply to finding longest common substring

The following solution has a lot of overhead, but the algorithm should be more scalable than yours. The pathological case that it sucks at is multiple large strings with one character repeated many, many times. But on real text it should fairly quickly narrow down to just working on text that is repeated at least once per string.

I'm sure that this approach could be made much faster and clearer.

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; for ([ qw(fooabc123 fooabc321 foobca232) ], [ qw(abcfoo123 bcafoo321 foo123abc) ], [ qw(foo bor boz bzo) ]) { print Dumper($_); print find_lcs(@{ $_ }), "\n"; print "---\n"; } sub find_lcs { my @index_info; foreach (@_) { my $string = $_; push @index_info, [ \$string, [ 0..length($string) ], ]; } my ($length, @pos) = _find_lcs(0, @index_info); return $length ? substr(${$index_info[0]->[0]}, $pos[0], $length) : undef; } # Explanation of the datastructure. @index_info is an # array of refs like [$string_ref, [@positions]] where # $string_ref is a reference to a string (so copying # it around doesn't mean copying the string) and # @positions are the starting positions in the string # which so far have a common start. # # We will add a third element which is a hash of next # chars pointing to which positions follow that common # path. # # The eventual return will be the length farther that # we follow, followed by an array of starting positions, # one from each string, that goes to that depth. sub _find_lcs { my ($offset, @index_info) = @_; my $last_chars; foreach my $data (@index_info) { my %chars; foreach my $pos (@{$data->[1]}) { my $char = substr(${$data->[0]}, $pos + $offset, 1); # Filter out end of string and ones that can't match next unless length($char); next if $last_chars and not exists $last_chars->{$char}; push @{$chars{$char}}, $pos; } # EXIT: Break out early if we have run out of common chars return 0 unless %chars; $last_chars = $data->[2] = \%chars; } my $best_length = 0; my @best_pos; foreach my $char (keys %$last_chars) { my @index_info_for_char = map { [ $_->[0], $_->[2]->{$char} ] } @index_info; my ($length, @pos) = _find_lcs($offset + 1, @index_info_for_char); # I'm only interested if this is an improvement next if $length < $best_length; $best_length = $length + 1; if (0 == $length++) { @best_pos = map {$_->[1]->[0]} @index_info_for_char; } else { @best_pos = @pos; } } return $best_length, @best_pos; }

Replies are listed 'Best First'.
Re: Re: finding longest common substring
by BrowserUk (Pope) on Nov 20, 2003 at 05:39 UTC

    This is an interesting problem and that's an intruiging algorithm. It would be really nice to see these all implemented in C and compared. I suspect that yours would fair much better.

    Which set of data you consider to be more realistic, will determine which algorithm/implementation is better suited to your application I guess. It's not very often that the choice of best algorithm varies so wildly with the input data.

    fooabc123 fooabc321 foobca232 Rate Tilly revdiablo CombatSqu BrowserUk Tilly 89.6/s -- -45% -85% -85% revdiablo 163/s 81% -- -72% -72% CombatSqu 581/s 549% 257% -- -1% BrowserUk 587/s 554% 261% 1% -- abcfoo123 bcafoo321 foo123abc Rate Tilly revdiablo CombatSqu BrowserUk Tilly 91.7/s -- -37% -82% -84% revdiablo 145/s 58% -- -72% -74% CombatSqu 514/s 460% 255% -- -9% BrowserUk 563/s 514% 289% 10% -- foo bor boz bzo Rate Tilly revdiablo BrowserUk CombatSqu Tilly 408/s -- -43% -59% -82% revdiablo 719/s 76% -- -28% -68% BrowserUk 995/s 144% 38% -- -56% CombatSqu 2236/s 449% 211% 125% -- The quick brown fox jump over the lazy dog The quick brown fox jumps over the lazy jumps over the lazy dog The quick brown fox quick brown fox jumps over the lazy dog Rate Tilly revdiablo CombatSqu BrowserUk Tilly 3.59/s -- -59% -89% -95% revdiablo 8.75/s 143% -- -74% -87% CombatSqu 33.4/s 829% 282% -- -51% BrowserUk 68.6/s 1807% 683% 105% -- The quick brown fox jump over the lazy dog The quick brown fox jumps over the lazy jumps over the lazy dog The quick brown fox quick brown fox jumps over the lazy dog x Rate revdiablo CombatSqu BrowserUk Tilly revdiablo 3.79/s -- -71% -91% -95% CombatSqu 12.9/s 240% -- -69% -85% BrowserUk 41.0/s 984% 219% -- -51% Tilly 83.2/s 2098% 546% 103% -- The quick brown fox jump over the lazy dog xThe quick brown fox jump over the lazy dog xxThe quick brown fox jump over the lazy dog xxxThe quick brown fox jump over the lazy dog xxxxThe quick brown fox jump over the lazy dog Rate Tilly BrowserUk revdiablo CombatSqu Tilly 0.761/s -- -98% -100% -100% BrowserUk 44.2/s 5714% -- -99% -99% revdiablo 4728/s 621405% 10589% -- -25% CombatSqu 6305/s 828641% 14153% 33% --

    Benchmark


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

      I guess that I had somewhat more overhead than I realized. I'll think about whether I can improve on that. :-(

      Incidentally my solution compares much better than others if you make the input strings much longer than the common substrings. For instance if the common match is "The quick brown fox" and the strings are each a few hundred characters, I win by a wide margin.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://308474]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2020-11-30 05:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?