"be consistent" PerlMonks

### Matching bits of 2 strings

by m2 (Sexton)
 on Jan 22, 2002 at 04:09 UTC Need Help??
m2 has asked for the wisdom of the Perl Monks concerning the following question:

Hiyas :) Here's the situation: I have many pairs of 2 strings that may (or may not) match each other from the beginning up to a certain point, then become different. What's the "best" (efficient, clean, etc) way of determining 1). the matching parts of the two strings 2). the part of each that's different?
i.e. for:
```\$string1 = "the date is today";
\$string2 = "the date is tomorrow";
I want:
```\$matchingpart = "the date is to"
\$diff1 = "day"
\$diff2 = "morrow"
It isn't hard to iterate over the first string letter by letter and do comparison to the second, but it really seems like there should be a more... elegant solution. Any ideas?
Regards,
m2.

Replies are listed 'Best First'.
Re: Matching bits of 2 strings
by japhy (Canon) on Jan 22, 2002 at 04:50 UTC
This is the "leading common substring" problem. There's been work in Perl done on it. Here's a relatively fast solution:
```# leading common substring
# returns length of LCS
# requires Perl 5.6+
sub lcs {
(\$_[0] ^ \$_[1]) =~ /\0*/;
return \$+[0];
}
If you're not fortunate enough to have Perl 5.6 yet, here's a compatible approach:
```# leading common substring
# returns length of LCS
# requires Perl 5.6+
sub lcs {
(\$_[0] ^ \$_[1]) =~ /(\0*)/;
return length \$1;
}
Update: in case you were worried, I do not need a ^ anchor in these regexes, because * will gladly match zero times at the beginning of the string if it has to.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Matching bits of 2 strings
by Fletch (Chancellor) on Jan 22, 2002 at 07:31 UTC

I had this problem come up a while back (in fact I think it was my question that prompted the thread on fwp about it . . .). At any rate, this was something I had lying around from it that benchmarks three solutions. You can decide for yourself if the first is cheating or if you really, really care about speed. :)

Also, note that these are returning the length of the common string not the string itself.

```#!/usr/bin/perl
use Benchmark qw( cmpthese );

use Inline C => <<EOF;
int comlen(char *p, char *q) {
int i = 0;
while( *p && (*p++ == *q++) )
i++;
return i;
}
EOF

sub comlen_or {
length(((\$_[0]^\$_[1])=~m/^(\0+)/)[0]);
}

sub comlen_tr {
my( \$t );

return (\$t=\$_[0]^\$_[1])=~ y/\0/\0/;
}

\$a = "abcdefghijk";
\$b = "abcdefg";

cmpthese( shift || 2_000_000,
{
inline_c => sub { comlen( \$a, \$b ) },
comlen_or => sub { comlen_or( \$a, \$b ) },
comlen_tr => sub { comlen_tr( \$a, \$b ) },
},
);

exit 0;
You've beaten me to the use of C. I remembered that it would be faster to call an XS (or Inline::C) function here. However, I'd like to point out that using tr/// for this problem is not a help at all. Using tr/\0// merely tells you how many characters are the same between the two words in total, not how many characters are the same at the start of the string. This is the fatal flaw. I believe this flaw existed in the LCS presentation at a YAPC in years gone by.

Example: calling the function with "brought" and "wrought" should return 0, except that the tr/// solution returns 6.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Matching bits of 2 strings
by Zaxo (Archbishop) on Jan 22, 2002 at 07:04 UTC

If you sort a list of the strings into an array, the deepest matches will be neighbors. That allows you to terminate the search after relatively few applications of japhy's excellent advice.

After Compline,
Zaxo

Re: Matching bits of 2 strings
by Anonymous Monk on Jan 22, 2002 at 13:49 UTC
My solution to the problem isn't as compact as the other solutions, but TIMTOWTDI!
```#! /usr/local/bin/perl

\$string1 = "the date is today";
\$string2 = "the date is tomorrow";

\$concat = \$string1 . "#" . \$string2;

if ( \$concat =~ m/^(.*)(.*)#\1(.*)\$/ ) {
print "matching part: \$1\n";
print "difference: \$2\ndifference: \$3\n";
} else {
print "strings \$string1 and \$string2 do not match!\n";
}
When run it generates the following output
```matching part: the date is to
difference: day
difference: morrow
I have to note that this solutions uses the fact dat the to strings doesn't contain any poundsymbols (#). When one of the to strings contain a poundsymbol, you have to use another symbol(sequence) to, for example:
```...
\$concat = \$string1 . "#!!#" . \$string2;

if ( \$concat =~ m/^(.*)(.*)#!!#\1(.*)\$/ ) {
...
I hope that this will give you a hint to construct your own solution!

Create A New User
Node Status?
node history
Node Type: perlquestion [id://140538]
Approved by root
help
Chatterbox?
 [stevieb]: in a C++ header file, if a function declaration is listed twice with each listing containing different params, that's just essentially allowing one function accept two different arg lists, yes? [stevieb]: int func(int x, int y); [stevieb]: int func(int x); [stevieb]: For example

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2017-06-28 23:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How many monitors do you use while coding?

Results (653 votes). Check out past polls.