Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Matching bits of 2 strings

by m2 (Sexton)
on Jan 22, 2002 at 04:09 UTC ( [id://140538]=perlquestion: print w/replies, xml ) Need Help??

m2 has asked for the wisdom of the Perl Monks concerning the following question:

Hiyas :) Here's the situation: I have many pairs of 2 strings that may (or may not) match each other from the beginning up to a certain point, then become different. What's the "best" (efficient, clean, etc) way of determining 1). the matching parts of the two strings 2). the part of each that's different?
i.e. for:
$string1 = "the date is today"; $string2 = "the date is tomorrow";
I want:
$matchingpart = "the date is to" $diff1 = "day" $diff2 = "morrow"
It isn't hard to iterate over the first string letter by letter and do comparison to the second, but it really seems like there should be a more... elegant solution. Any ideas?
Regards,
m2.

Replies are listed 'Best First'.
Re: Matching bits of 2 strings
by japhy (Canon) on Jan 22, 2002 at 04:50 UTC
    This is the "leading common substring" problem. There's been work in Perl done on it. Here's a relatively fast solution:
    # leading common substring # returns length of LCS # requires Perl 5.6+ sub lcs { ($_[0] ^ $_[1]) =~ /\0*/; return $+[0]; }
    If you're not fortunate enough to have Perl 5.6 yet, here's a compatible approach:
    # leading common substring # returns length of LCS # requires Perl 5.6+ sub lcs { ($_[0] ^ $_[1]) =~ /(\0*)/; return length $1; }
    Update: in case you were worried, I do not need a ^ anchor in these regexes, because * will gladly match zero times at the beginning of the string if it has to.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Matching bits of 2 strings
by Fletch (Bishop) on Jan 22, 2002 at 07:31 UTC

    I had this problem come up a while back (in fact I think it was my question that prompted the thread on fwp about it . . .). At any rate, this was something I had lying around from it that benchmarks three solutions. You can decide for yourself if the first is cheating or if you really, really care about speed. :)

    Also, note that these are returning the length of the common string not the string itself.

    #!/usr/bin/perl use Benchmark qw( cmpthese ); use Inline C => <<EOF; int comlen(char *p, char *q) { int i = 0; while( *p && (*p++ == *q++) ) i++; return i; } EOF sub comlen_or { length((($_[0]^$_[1])=~m/^(\0+)/)[0]); } sub comlen_tr { my( $t ); return ($t=$_[0]^$_[1])=~ y/\0/\0/; } $a = "abcdefghijk"; $b = "abcdefg"; cmpthese( shift || 2_000_000, { inline_c => sub { comlen( $a, $b ) }, comlen_or => sub { comlen_or( $a, $b ) }, comlen_tr => sub { comlen_tr( $a, $b ) }, }, ); exit 0;
      You've beaten me to the use of C. I remembered that it would be faster to call an XS (or Inline::C) function here. However, I'd like to point out that using tr/// for this problem is not a help at all. Using tr/\0// merely tells you how many characters are the same between the two words in total, not how many characters are the same at the start of the string. This is the fatal flaw. I believe this flaw existed in the LCS presentation at a YAPC in years gone by.

      Example: calling the function with "brought" and "wrought" should return 0, except that the tr/// solution returns 6.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Matching bits of 2 strings
by Zaxo (Archbishop) on Jan 22, 2002 at 07:04 UTC

    If you sort a list of the strings into an array, the deepest matches will be neighbors. That allows you to terminate the search after relatively few applications of japhy's excellent advice.

    After Compline,
    Zaxo

Re: Matching bits of 2 strings
by Anonymous Monk on Jan 22, 2002 at 13:49 UTC
    My solution to the problem isn't as compact as the other solutions, but TIMTOWTDI!
    #! /usr/local/bin/perl $string1 = "the date is today"; $string2 = "the date is tomorrow"; $concat = $string1 . "#" . $string2; if ( $concat =~ m/^(.*)(.*)#\1(.*)$/ ) { print "matching part: $1\n"; print "difference: $2\ndifference: $3\n"; } else { print "strings $string1 and $string2 do not match!\n"; }
    When run it generates the following output
    matching part: the date is to difference: day difference: morrow
    I have to note that this solutions uses the fact dat the to strings doesn't contain any poundsymbols (#). When one of the to strings contain a poundsymbol, you have to use another symbol(sequence) to, for example:
    ... $concat = $string1 . "#!!#" . $string2; if ( $concat =~ m/^(.*)(.*)#!!#\1(.*)$/ ) { ...
    I hope that this will give you a hint to construct your own solution!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://140538]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-03-29 07:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found