Finding common substrings

ktsirig has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding common substrings by jdporter (Paladin) on Sep 20, 2006 at 22:19 UTC
Of course there's an algorithm for doing that, but it's undoubtedly overkill for your application. I think you should bring some "semantics" to the problem to make it easier. For example, both of those strings look like they contain space-separated values. So you could convert both to arrays of non-space strings, and compare the two arrays for any elements in common. One way: `$a = 'PF01389 6 218 1 255 430.09'; $b = 'PF00691 PF01389'; my @a = split ' ', $a; my @b = split ' ', $b; my %a; @a{@a} = (); my @common = grep { exists $a{$_} } @b;` [download] Also see the Categorized Question How can I find the union/difference/intersection of two arrays? Update: Just for fun, here's another way, using regexes: `my $pattern = $b; # so as not to bash $b $pattern =~ s/\s+/\|/g; @common = " $a " =~ /\s($pattern)\s/g;` [download] Update2: That breaks down if the string being used as the source of the regex (`$b` here) contains regex-special characters. Better: `my $pattern = join '\|', map quotemeta($_), split ' ', $b; @common = " $a " =~ /\s($pattern)\s/g;` [download] We're building the house of the future together.	[reply] [d/l] [select]
Re: Finding common substrings by johngg (Canon) on Sep 20, 2006 at 22:29 UTC
Firstly, `$a` and `$b` are best avoided as variable names as they are pre-declared for use with the `sort` function. You could do this task by splitting one string on spaces (assuming your data is always space delimited) to populate a hash and then splitiing the second string and checking if any part of that exists already in the hash. `use strict; use warnings; my $str1 = q{PF01389 6 218 1 255 430.09}; my $str2 = q{PF00691 PF01389}; my %str1Hash = map {$_ => 1} split m{\s+}, $str1; foreach my $possible (split m{\s+}, $str2) { print qq{$possible common\n} if exists $str1Hash{$possible}; }` [download] I hope this is of use. Cheers, JohnGG	[reply] [d/l] [select]
Re: Finding common substrings by ayrnieu (Beadle) on Sep 20, 2006 at 22:13 UTC
`use List::Compare; use caveat "I haven't tested this."; my @in_both = List::Compare->new([split /\s+/, $a], [split /\s+/, $b]) +->get_intersection;` [download]	[reply] [d/l]
Re: Finding common substrings by mreece (Friar) on Sep 20, 2006 at 22:34 UTC
split each string and look for dupes .. `$a = 'PF01389 6 218 1 255 430.09'; $b = 'PF00691 PF01389'; my %counts; foreach ( split /\s+/, $a ) { $counts{$_} = 1; } foreach ( split /\s+/, $b ) { $counts{$_}++ if exists $counts{$_}; } my @common = grep $counts{$_} > 1, keys %counts; if ( @common ) { print "correct\n"; }` [download] or, less verbose, `$a = 'PF01389 6 218 1 255 430.09'; $b = 'PF00691 PF01389'; my %in_a = map { $_ => 1 } split /\s+/, $a; my @in_both = grep { exists $in_a{$_} } split /\s+/, $b; if ( @in_both ) { print "correct\n"; }` [download]	[reply] [d/l] [select]
Re^2: Finding common substrings by ktsirig (Sexton) on Sep 20, 2006 at 22:46 UTC
Thank you all! You really helped me understand a lot of things just by this question I had!	[reply]
Re^2: Finding common substrings by johngg (Canon) on Sep 21, 2006 at 09:19 UTC
I might be wrong but I think your first method will give a false positive if one string contains a duplicated word but that word doesn't appear in the other string. The `$counts{$_}` will be more than one but only because the word appeared twice in the same string, not because it was duplicated in the other string. Cheers, JohnGG	[reply] [d/l]
Re^3: Finding common substrings by mreece (Friar) on Sep 21, 2006 at 16:36 UTC
actually, it won't, because the first foreach only sets to 1 and not ++, and the second foreach only does ++ it if already exists, which means it was already found in `$a`.	[reply] [d/l]
Re^4: Finding common substrings by johngg (Canon) on Sep 21, 2006 at 20:50 UTC
Re: Finding common substrings by Anonymous Monk on Sep 21, 2006 at 04:12 UTC
With a bunch of assumptions: `$a = 'PF01389 6 218 1 255 430.09'; $b = 'PF00691 PF01389'; $_ = " $a \n $b "; # combine strings for one regex print "$_\n" for / (\S+) (?=.\n. \1 )/g;` [download]	[reply] [d/l]
Re: Finding common substrings by Persib (Acolyte) on Sep 21, 2006 at 10:10 UTC
`if($a =~ /[$b]/) { print "true \n" };` [download] UPDATE Just Ignore my Code, this's totally wrong, i'm sorry, (maybe i'm too tired)	[reply] [d/l]
Re: Finding common substrings by bsb (Priest) on Sep 21, 2006 at 15:29 UTC
http://en.wikipedia.org/wiki/Longest_common_substring_problem http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/	[reply]
Re^2: Finding common substrings by planetscape (Chancellor) on Sep 22, 2006 at 01:52 UTC
Or how about links right here on PM, even: Longest Common Subsequence Longest repeated string... Fast common substring matching Search for identical substrings finding longest common substring Longest Common Substring planetscape	[reply]


We don't bite newbies here... much
	PerlMonks