Challenge: Fast Common Substrings

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Challenge: Fast Common Substrings by blokhead (Monsignor) on Apr 04, 2007 at 02:04 UTC
Here's just a naive solution using regexes. I have no idea if it's fast. On one hand, the regex engine is doing all the work; on the other hand, in the second code snippet, it's just brute-forcing the problem. The `match_all_ways` sub is taken from Re^3: Delimited Backtracking with Regex. `{ my @matches; my $push = qr/(?{ push @matches, $1 })/; sub match_all_ways { my ($string, $regex) = @_; @matches = (); $string =~ m/($regex)$push(?!)/; return @matches; } } sub common_substr { my ($str1, $str2, $len) = @_; my %substr = map { $_ => 1 } match_all_ways($str1 => qr/.{$len}/); $substr{$_} \|= 2 for match_all_ways($str2 => qr/.{$len}/); return grep { $substr{$_} == 3 } keys %substr; } print "$_\n" for common_substr("ABCDEF", "ABDEFCBDEAB", 2); __END__ DE EF AB` [download] Returns all the common substrings, since it might as well.. In scalar context returns the number. Update: Or, if you want the regex engine to do all of the work, instead of computing the intersection of the two lists in perl: `{ my @matches; my $push = qr/(?{ push @matches, $1 })/; sub match_all_ways { my ($string, $regex) = @_; @matches = (); $string =~ m/$regex$push(?!)/; return @matches; } } sub common_substr { my ($str1, $str2, $len) = @_; my %subs; @subs{ match_all_ways("$str1\0$str2" => qr/(.{$len}).\0.\1/) } + = (); return keys %subs; }` [download] Note that the match_all_ways sub was changed slightly (to account for the different capturing). Disclaimer: If input strings contain a newline or null character, or if $len > 65536, it doesn't work.. but I think you get the idea. These solutions are both trivial to extend to a range of lengths.. just pass "n,m" as the $len argument. blokhead	[reply] [d/l] [select]
Re: Challenge: Fast Common Substrings by BrowserUk (Patriarch) on Apr 04, 2007 at 02:54 UTC
Here's my best attempt so far: `sub nCommonSubstrLenL { my( $haystack, $needle, $len ) = @_; ( $haystack, $needle ) = ( $needle, $haystack ) if length( $haystack ) < length( $needle ); # Added my $count = 0; my %possibles; for my $ni ( 0 .. length( $needle ) - $len ) { my $possible = substr( $needle, $ni, $len ); next if ++$possibles{ $possible } > 1; ++$count if 1+index $haystack, $possible; } return $count; }` [download] Update: A slightly faster reformulation. Updated again to work for lengths other than 2. `sub nCommonSubstrLenL2 { my( $haystack, $needle, $len ) = @_; ( $haystack, $needle ) = ( $needle, $haystack ) if length( $haystack ) < length( $needle ); ## Added. # my $pattern = "A$len X" x int( length( $needle ) / $len ); my $pattern = (" A$len" . 'X' x ( $len-1 )) x (length( $needle ) - + $len +1); my $count = 0; my %possibles; for my $possible ( unpack $pattern, $needle ) { next if ++$possibles{ $possible } > 1; ++$count if 1+index $haystack, $possible; } return $count; }` [download] And buking for the bonus, a not well tested version that looks for all common substring equal or greater in length than the user specifed parameter: sub nCommonSubstrGreaterLenL { my( $haystack, $needle, $len ) = @_; my $count = 0; my %possibles; for my $l ( $len .. length( $needle ) ) { my $sofar = $count; for my $ni ( 0 .. length( $needle ) - $l ) { my $possible = substr( $needle, $ni, $l ); next if ++$possibles{ $possible } > 1; ## print( "$possible : $count" ), ++$count if 1+index $haystack, $possible; } last unless $count > $sofar; } return $count; } print 'Buk:', nCommonSubstrGreaterLenL 'ABCDEF','ABDEFCBDEAB', 2; __END__ AB : 0 DE : 1 EF : 2 DEF : 3 Buk:4 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Challenge: Fast Common Substrings by thezip (Vicar) on Apr 04, 2007 at 05:04 UTC
I throw my hat into the ring with my recursive implementation. I think it could perform reasonably well, since it is a divide-and-conquer type solution. I'm not sure how it will stand up to the hash-based solutions, and there might be some "correctness" issues... For the bonus, though, mine increments the count for any length of matching substrings. There are probably many opportunities for optimizations here... please offer criticism. #! perl -w use strict; my $hits = 0; sub common_substr { my($s1, $s2) = @_; print qq(s1 = $s1, s2 = $s2\n); ($s1, $s2) = ($s2, $s1) if length($s2) < length($s1); if ($s1 eq $s2) { $hits++; return if length($s1) == 1; } my %hash = map { $_ => 1 } split(//, $s1); my $arr = []; for my $s (split(//, $s2)) { push(@$arr, $s) if ! exists($hash{$s}); } my $splitters = join('\|', @$arr); for my $s (split(/$splitters/, $s2)) { common_substr($s, $s1); } } common_substr("abcef", "abcdef"); print "hits = $hits\n"; [download] Where do you want them* to go today?*	[reply] [d/l]
Re^2: Challenge: Fast Common Substrings by thezip (Vicar) on Apr 04, 2007 at 20:45 UTC
OK, I concede to the Suffix Tree solution presented by lima1 ++. I suspect the best run-time order I could muster is O(nlogn), and worst O(n²). It was still a fun diversion nonetheless... :-) Where do you want them* to go today?*	[reply]
Re^3: Challenge: Fast Common Substrings by BrowserUk (Patriarch) on Apr 04, 2007 at 21:54 UTC
I wouldn't give up yet. You need to see an implementation and how it benchmarks. Implementing Suffix Trees in pure perl is messy, memory expensive (hugely so in my attempts), and rarely works out more efficient. Don't forget that it's O(n) to build as well as O(m) to use. If you are only using it once, that snips into the benefits. For short strings a linear search (at C speed) and no build time wins easily. For longer strings, the indexing a suffix tree provides is a winner, but it gets set back awfully by the build time and memory management. If you only use the tree the once (as for this challenge), there little or no win. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Challenge: Fast Common Substrings by lima1 (Curate) on Apr 04, 2007 at 15:50 UTC
Just for the sake of completeness: A fast and elegant algorithm for this is a tricky use of suffix trees. One concatenates the two strings of length n and m, say abcdef%efgab$. It is possible to construct a suffix tree of this string in O(n+m) (Ukkonen algorithm). To find the common substrings, one has then to search for nodes that have exactly two (or the number of strings) leafs belonging to the different words. The resulting suffix tree for "abcdef" and "efgab": `\| \|(3:cdef%efgab$)\|leaf \|(1:ab)\| \| \|(13:$)\|leaf tree:\| \| \|(3:cdef%efgab$)\|leaf \|(2:b)\| \| \|(13:$)\|leaf \| \|(3:cdef%efgab$)\|leaf \| \|(4:def%efgab$)\|leaf \| \| \|(7:%efgab$)\|leaf \|(5:ef)\| \| \|(10:gab$)\|leaf \| \| \|(7:%efgab$)\|leaf \|(6:f)\| \| \|(10:gab$)\|leaf \| \|(7:%efgab$)\|leaf \| \|(10:gab$)\|leaf \|` [download] So "ab" has two leafs in the different words (position <= 7 for leaf 1 and position > 7 for leaf 2). So have 'b', 'ef' and 'f'. http://en.wikipedia.org/wiki/Longest_common_substring_problem Update: Just found some perl code with google ... on perlmonks ;) Re: finding longest common substring	[reply] [d/l]
Re^2: Challenge: Fast Common Substrings by blokhead (Monsignor) on Apr 04, 2007 at 16:02 UTC
++ Wow, thank you for introducing me to suffix trees. What an interesting concept, and how refreshing to see a linear-time algorithm for constructing such a creature. I see you've used the javascript applet at this page, which others may want to check out. However, I'd like to slightly revise the algorithm you outlined. Consider the following example: `string = ababc%bc$ \| \|(3:abc%bc$)\|leaf \|(1:ab)\| \| \|(5:c%bc$)\|leaf tree:\| \| \|(3:abc%bc$)\|leaf \|(2:b)\| \| \| \|(6:%bc$)\|leaf \| \|(5:c)\| \| \| \|(9:$)\|leaf \| \| \|(6:%bc$)\|leaf \|(5:c)\| \| \|(9:$)\|leaf \| \|(6:%bc$)\|leaf \| \|(9:$)\|leaf` [download] "ab" appears twice in the first string, and so it gives a node with two leaves. The actual condition you should check is whether a node has one leaf containing the % separator and another leaf without the % symbol. blokhead	[reply] [d/l]
Re^3: Challenge: Fast Common Substrings (O(n)?) by tye (Sage) on Apr 04, 2007 at 21:53 UTC
The page you link to mentions being able to build them in O(n) but then only really describes how to go from a suffix tree for string $x to one for string $x.$c (1==length$c) in O(length $x). Using that algorithm would require O(N*N) to build the suffix tree for a string of length N. So I'm not sure I believe the O(N) claim for building the whole suffix tree based on that page. - tye	[reply]
Re^4: Challenge: Fast Common Substrings (O(n)?) by lima1 (Curate) on Apr 04, 2007 at 22:12 UTC
Re^3: Challenge: Fast Common Substrings by lima1 (Curate) on Apr 04, 2007 at 16:25 UTC
Or even easier: check the positions of the substrings (<=7 and > 7 in my example).	[reply]
Re: Challenge: Fast Common Substrings by eric256 (Parson) on Apr 05, 2007 at 00:03 UTC
I dunno about speed, but its the only version i can actualy understand so far ;) and who knows, it might not be as slow as you think...probably will be..but hey! I think the second loop through s2 could probably have some short circuits added where it knows the sub string it is on doesn't occur in the first string..but it's time to go home! use strict; use warnings; use Data::Dumper; sub common_sub { my ($s1,$s2,$len) = @_; my $len_s1 = length($s1); my $len_s2 = length($s2); my $match_s1 = {}; my $match_s2 = {}; for my $start (0..length($s1)-1) { for my $l (1..$len) { next if $start+$l > $len_s1; $match_s1->{substr($s1, $start, $l)} \|\|= 1; } } for my $start (0..length($s2)-1) { for my $l (1..$len) { next if $start+$l > $len_s2; $match_s2->{substr($s2, $start, $l)} \|\|= 1; } } $match_s1->{$_}++ for keys %$match_s2; return grep { $match_s1->{$_} == 2 } keys %$match_s1; } print join(",", common_sub("ABCD", "BCD",2)); [download] ___________ Eric Hodges	[reply] [d/l]
Re: Challenge: Fast Common Substrings by BrowserUk (Patriarch) on Apr 05, 2007 at 15:23 UTC
Okay here is a benchmark as requested. I doubt it will satisfy everyone. Some comments on the benchmark and results obtained. Randomly generated strings of 30 .. 50 lower case characters rarely if ever produce common substrings of 3 characters, never mind longer. To this end I used 'A'..'D', which produces a few at 3, 4, & 5 characters long, but never longer that I saw. Even adjacent pairings of 5000 strings takes a very long time to run, never mind a full combinatorial pairing. I used 500 strings instead. It's a command line parameter, if you've the time and inclination to run 5000, go ahead. I doubt it will make any huge difference to the outcome. theZip's (TZ) code is horrible to benchmark as it relies upon a global var. Ie. The sub does not return the result. I had a half hearted attempt to address this using a helper sub closing over the global, but the result are wildly inaccurate. Possibly? the fault of my adaption. Sorry thezip. However, it seems to be far and away the slowest algorithm anyway. eric256's (eric) code produces numbers that seem not to relate to the inputs as far a I can discern. Maybe I screwed up, but I don't think so. It's also the second slowest algorithm. Mine (Buk), blokhead's (BH1/BH2), moron's (MN), and Limbic~Region's (LR) code all produce the same counts (although Limbic~Region's has been seen to be one shy of the others on a few occasions). The level of concurrence between these disparate algorithms and implementations is taken to indicate that they are producing the correct results. I have not manually verified them. Results (500 random strings of `'A'..'D'`, tested each against the next for common substrings 3-7 characters Read more... (4 kB) The benchmark code. CLI parameters are: -N=nn numbers of strings to generate; -LENGTH=mm: length of common substrings to look for; Interesting challenge Limbic~Region, thanks. <Reveal this spoiler or all in this thread> Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Challenge: Fast Common Substrings by BrowserUk (Patriarch) on Apr 05, 2007 at 15:34 UTC
The same benchmark as above but excluding the two slowest/erroneous algorithms, allows me to run 1000 strings looking for common substrings of lengths 2 .. 6 and get more digestable results: Read more... (2 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^3: Challenge: Fast Common Substrings by eric256 (Parson) on Apr 06, 2007 at 20:14 UTC
/me benchmarks agian with his in there as a matter of pride. C:\Perl\test>substr_bench.pl -N=1000 -LENGTH=2 Rate BH2 BH1 MN Eric L~R Buk BH2 3.37/s -- -20% -53% -70% -78% -87% BH1 4.21/s 25% -- -42% -62% -72% -84% MN 7.22/s 114% 71% -- -35% -52% -73% Eric 11.1/s 231% 164% 54% -- -26% -58% L~R 15.1/s 347% 258% 109% 35% -- -44% Buk 26.7/s 694% 534% 270% 140% 77% -- ---- BH2 => 14 L~R => 12 Eric => 14 Buk => 14 BH1 => 14 MN => 14 C:\Perl\test>substr_bench.pl -N=1000 -LENGTH=3 Rate BH1 MN BH2 Eric L~R Buk BH1 3.88/s -- -23% -39% -46% -73% -80% MN 5.05/s 30% -- -21% -30% -65% -74% BH2 6.40/s 65% 27% -- -11% -56% -67% Eric 7.21/s 86% 43% 13% -- -50% -63% L~R 14.4/s 272% 185% 125% 100% -- -26% Buk 19.4/s 400% 284% 203% 169% 35% -- ---- BH2 => 9 L~R => 9 Eric => 9 Buk => 9 BH1 => 9 MN => 9 C:\Perl\test>substr_bench.pl -N=1000 -LENGTH=4 Rate BH1 MN Eric BH2 L~R Buk BH1 3.56/s -- -19% -43% -57% -74% -79% MN 4.39/s 23% -- -30% -47% -68% -74% Eric 6.22/s 75% 42% -- -25% -54% -63% BH2 8.35/s 135% 90% 34% -- -38% -51% L~R 13.5/s 281% 209% 118% 62% -- -20% Buk 16.9/s 375% 285% 171% 102% 25% -- ---- BH2 => 4 L~R => 4 Eric => 4 Buk => 4 BH1 => 4 MN => 4 C:\Perl\test>substr_bench.pl -N=1000 -LENGTH=5 Rate BH1 MN Eric BH2 L~R Buk BH1 3.82/s -- -14% -34% -59% -72% -78% MN 4.44/s 16% -- -24% -52% -67% -74% Eric 5.82/s 52% 31% -- -37% -57% -66% BH2 9.27/s 143% 109% 59% -- -32% -46% L~R 13.6/s 255% 205% 133% 46% -- -21% Buk 17.1/s 348% 285% 194% 85% 26% -- ---- BH2 => 0 L~R => 0 Eric => 0 Buk => 0 BH1 => 0 MN => 0 C:\Perl\test>substr_bench.pl -N=1000 -LENGTH=6 (warning: too few iterations for a reliable count) Rate BH1 MN Eric BH2 L~R Buk BH1 3.56/s -- -21% -38% -64% -73% -80% MN 4.50/s 27% -- -21% -54% -66% -74% Eric 5.73/s 61% 27% -- -42% -57% -67% BH2 9.84/s 177% 119% 72% -- -25% -43% L~R 13.2/s 271% 193% 130% 34% -- -24% Buk 17.4/s 389% 286% 203% 77% 32% -- ---- BH2 => 1 L~R => 1 Eric => 1 Buk => 1 BH1 => 1 MN => 1 [download] ___________ Eric Hodges	[reply] [d/l]
Re^2: Challenge: Fast Common Substrings by eric256 (Parson) on Apr 06, 2007 at 20:07 UTC
Hey I misread the definition of substring to mean all substrings of that length or lower...quick easy fix and it starts placing decent. sub common_sub { my ($s1,$s2,$len) = @_; my $len_s1 = length($s1); my $len_s2 = length($s2); my $match_s1 = {}; my $match_s2 = {}; for my $start (0..length($s1)-1) { #for my $l (1..$len) { next if $start+$len > $len_s1; $match_s1->{substr($s1, $start, $len)} \|\|= 1; #} } for my $start (0..length($s2)-1) { #for my $l (1..$len) { next if $start+$len > $len_s2; $match_s2->{substr($s2, $start, $len)} \|\|= 1; #} } $match_s1->{$_}++ for keys %$match_s2; return grep { $match_s1->{$_} == 2 } keys %$match_s1; } [download] ___________ Eric Hodges	[reply] [d/l]
Re: Challenge: Fast Common Substrings by Moron (Curate) on Apr 04, 2007 at 10:01 UTC
# more update: this is the tested version now!! sub CountSubstrings { # $string1, $string2, $substr_length my $l = pop @_; my %found = (); my %match = (); my $first = 2; my $string1 = $_[0]; for my $string ( @_ ) { $first --; my $ls = length( $string ); my $limit = $ls - $l + 1; for ( my $i = 0; $i < $limit; $i++ ) { my $sbstr = substr( $string, $i, $l ); $first or defined ( $match{$sbstr} ) && next(); $found{ $sbstr }{ $string } \|\|= 1; $first and next;; defined ( $found{ $sbstr }{ $string1 } ) and $match{ $sbstr } = 1; } } Count( keys %match ); } sub Count { 1+$#_ }; [download] Description of alggorithm: Store all substrings of string 1 of the given length in subhash $found{ $string1 }; do the same for string 2 but in addition lookup in the first hash and record any matches in the %match hash. The suggested opportunities for optimisation include (a) only need to check matches for the substring iteration of the second string. (b) only need to build the substring for string 2 if there is no match found in the match hash. -M Free your mind	[reply] [d/l]
Re^2: Challenge: Fast Common Substrings by BrowserUk (Patriarch) on Apr 04, 2007 at 10:25 UTC
# untested True. It doesn't even compile. And when you've corrected the numerous errors, it doesn't work. And it's hard to see how you thought it would. Care to enlighten? (And did you notice the word "fast" in the title?) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Challenge: Fast Common Substrings by Moron (Curate) on Apr 04, 2007 at 12:12 UTC
I think many people will see this is only an algorithm outline and will take the word "untested" as an approriate hint in that regard. I am multitasking at the moment. re "enlighten" - I think under the circumstances the most appropriate enlightment I can offer is to improve your lateral thinking In particular I recommend Six Thinking Hats. You would benefit from the insight that you tend to be stuck wearing a black hat (negative logic) and need to learn how to take it off at will and replace it with other forms of thinking when needed. If you need evidence - In your post you say on the one hand you find it difficult to see how I expect it to work and yet you go on to imply an ability to assess its performance when corrected. That's like saying the performance of anything in a category is X without realising the scope you have already accepted for that category, making it impossible to make any sensible such judgement. Conversely, all I am offering is a means to use hashes to solve the problem and I sincerely expect this to perform well when corrected given the pure simplicity of the algorithm and the opportunities it uses to cut out unnecessary iteration. I think you will find that many beginners out there will find it easy enough to correct this code and get a fast-performing result. So why not you, eh? You're clearly not nearly a beginner are you. So what else can it be ... ? -M Free your mind	[reply]
Re^4: Challenge: Fast Common Substrings by BrowserUk (Patriarch) on Apr 04, 2007 at 12:27 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.


Perl-Sensitive Sunglasses
	PerlMonks