Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Tell whether two strings differ at only one position

by rg0now (Chaplain)
on Aug 04, 2005 at 17:01 UTC ( [id://480930]=perlquestion: print w/replies, xml ) Need Help??

rg0now has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks!

I need to be able to tell, whether two lower-case strings differ at only one position, or they are the same. They must also be of the same length. That is

abab  -   abab     is OK
abab  -   abaa     is OK
abab  -   qrst     is not OK
abab  -   abba     is not OK
abab  -   ababa    is not OK
I need the fastest solution on Earth, since I have to do an enormous number of comparisions (for all two words of a fairly large dictionary). My solution is based on splitting the words into arrays and compare positionally, but it is painfully slow.
sub compare{ return 0 unless length $_[0] == length $_[1]; return 1 if $_[0] eq $_[1]; my $diff = 0; my @l1 = split //, $_[0]; my @l2 = split //, $_[1]; for(my $i = 0; $i < scalar @l1; $i++){ $diff++ if $l1[$i] ne $l2[$i]; return 0 if $diff > 1; } return 1; }
Could you make it much faster?

Replies are listed 'Best First'.
Re: Tell whether two strings differ at only one position
by BrowserUk (Patriarch) on Aug 04, 2005 at 17:32 UTC

    Try

    sub compare{ return ( length $_[0] == length $_[1] ) && ( length( $_[0] ) - 1 <= ( $_[0] ^ $_[1] ) =~ tr[\0][\0] ) }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Tell whether two strings differ at only one position
by japhy (Canon) on Aug 04, 2005 at 18:50 UTC
    Here's another XOR solution that I think is far more to the point:
    sub compare { (($_[0] ^ $_[1]) =~ tr/\0//c) < 2; }
    It counts the number of non-NULLs produced by XORing the strings together. If that number is less than 2, then the strings were either identical or differed by only one character. This could also be written as:
    sub compare { ($_[0] ^ $_[1]) =~ /^\0*(?:[^\0]\0*)?$/ }
    The regex ensures the XORed string is either entirely NULLs or has just one non-NULL in it.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      Wow! japhy++
      sub compare { (($_[0] ^ $_[1]) =~ tr/\0//c) < 2; }
      While I liked each and every solution other wise monks have been so kind to come up with so far, when I looked at the solution of japhy, I immediately saw that this is the one I have been searching for! It is clean and concise and perlish, and, quite amazingly, it is just one statement, so I can eliminate the compare sub alltogether which, as I hope, will make my script even faster!

      I am beginning to see that after a certain point, Perl programming becomes a matter of pure aesthetics...

        Except it doesn't work:
        sub compare { (($_[0] ^ $_[1]) =~ tr/\0//c) < 2; } print((compare('abc', 'abcd') ? 'ok' : 'not ok'), "\n");
        It should be
        sub compare { length $_[0] == length $_[1] && (($_[0] ^ $_[1]) =~ tr/\0//c) < 2; }
      This is what I wanted to do myself in the first place. Only that... I plainly didn't remember about the /c switch! And I tried with the /d one (without assignment, that is) but it failed like this:
      Can't modify bitwise xor (^) in transliteration (tr///) at foo.pl line + 7, near "tr/\0//d) "
      So I adopted a temp variable. Incidentally I wonder why I don't get the same error if I omit /d (with or without /c)... in both cases I'm attempting at modifying it, am I not?

        With /d and /c, you are modifying the string.

        Without either, and with source & destination tables being the same, is a special case designed for counting only.

        There would be no point in modifying the string just to replace like with like.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Tell whether two strings differ at only one position
by blazar (Canon) on Aug 04, 2005 at 17:26 UTC
    sub compare { (my $xor=$_[0]^$_[1]) =~ tr/\0//d; length $xor < 2; }
    Untested. And not benchmarked. But that would be my first try at doing what you need.
Re: Tell whether two strings differ at only one position
by Albannach (Monsignor) on Aug 04, 2005 at 18:13 UTC
    I make a lot of use of Text::Levenshtein which has an XS version that might be competitive, so here is a quick test. Fair warning: I'm far from a Benchmark guru (comments welcome), and the test set you provided might have a different distribution of types than you actually encounter, so if a particular method takes longer for a particular result, there will be some bias.
    Rate rg0now ikegami levenshtein blazar +BrowserUk rg0now 8801/s -- -57% -82% -87% + -87% ikegami 20414/s 132% -- -57% -69% + -70% levenshtein 47786/s 443% 134% -- -28% + -30% blazar 66258/s 653% 225% 39% -- + -4% BrowserUk 68683/s 680% 236% 44% 4% + --
    The code follows.

    --
    I'd like to be able to assign to an luser

Re: Tell whether two strings differ at only one position
by ikegami (Patriarch) on Aug 04, 2005 at 17:33 UTC
    sub compare { return unless length $_[0] == length $_[1]; # Omit if expecting differences almost all the time. return 1 if $_[0] eq $_[1]; my $limit = 1; foreach (split('', $_[0]^$_[1])) { next unless ord; return 0 unless $limit--; } return 1; }

    Doesn't work with Unicode.

Re: Tell whether two strings differ at only one position
by Not_a_Number (Prior) on Aug 04, 2005 at 20:27 UTC

    This:

    use strict; use warnings; # @data contains the (chomped, lower-cased) entries # from a 70K dictionary file (2076 words beginning with 'j') my $count; # Or whatever: this can check algorithm validity foreach my $item ( @data ) { compare( $item, $_ ) and $count++ for @data; } print $count; sub compare { return 0 unless length $_[0] == length $_[1]; my $diff = 0; for ( 0 .. length $_[0] ) { $diff++ if substr( $_[0], $_, 1 ) ne substr( $_[1], $_, 1 ); return 0 if $diff > 1; } return 1; }

    is very much faster than your original code.

    Surprisingly(?), initial testing seems to show that it is also considerably faster than any of the replies you have hitherto received using bitwise XOR.

    I haven't got the time to do any precise benchmarking at the moment: however, since you need the "fastest solution on Earth", I shall leave that up to you

      I adopted Albannach's nice benchmark (thanks Albannach) and reBenchmarked all the solutions again. But now, for data, I used a random set of nice Hungarian words. Here are the results:
                     Rate    rg0now  ikegami Not_a_Number BrowserUk   blazar     japhy
      rg0now        288/s        --     -66%         -82%      -95%     -96%      -97%
      ikegami       849/s      195%       --         -45%      -87%     -88%      -90%
      Not_a_Number 1557/s      441%      83%           --      -76%     -78%      -83%
      BrowserUk    6382/s     2118%     652%         310%        --     -10%      -28%
      blazar       7084/s     2362%     735%         355%       11%       --      -20%
      japhy        8900/s     2993%     949%         472%       39%      26%        --
      

      It seems that japhy's solution is the clear winner, even if it is degraded to a sub call. I think that your results might be attributed your specific choice of data, but I do not know enough about the intrinsics of Perl string handling to say the decisive words (in fact, I do not know anything about it, so...)

        Both blazar's and japhy's return true when passed 'fred', 'freda' which obviously isn't right.

        When you add the code to detect that error, you get:

        Rate rg0now ikegami Not_a_Number blazar japhy + BrowserUk rg0now 412/s -- -69% -85% -96% -96% + -96% ikegami 1311/s 218% -- -54% -86% -88% + -89% Not_a_Number 2842/s 589% 117% -- -70% -74% + -75% blazar 9628/s 2236% 634% 239% -- -13% + -17% japhy 11086/s 2590% 745% 290% 15% -- + -4% BrowserUk 11580/s 2709% 783% 307% 20% 4% + --

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

        Playing more than understanding. Using the same benchmarks I got about a 3-5% speed up with a modified version of blazar's which skips lexical assignment by using $_. Tried a few arrangements with length and local and this was the fastest I stumbled on. OS X, v5.8.2.

        sub compare_blazar_mod { ( $_ = $_[0] ^ $_[1] ) =~ tr/\0//d; 2 < length; }
Re: Tell whether two strings differ at only one position
by demerphq (Chancellor) on Aug 05, 2005 at 08:05 UTC

    Its not clear to me if it would help or not, but you may want to look at Algorithm Showdown: Fuzzy Matching which has some high speed solutions for doing this type of problem, albeit for finding all of the fuzzy matches of a list of a words of the same length in gene strings. Warning: some of the discussion in that thread is somewhat acrimonious.

    ---
    $world=~s/war/peace/g

Re: Tell whether two strings differ at only one position
by NiJo (Friar) on Aug 05, 2005 at 18:07 UTC
    The speed race for the most generic case seems to be over. But looking at the real data and considering hit probabilities should provide the largest speed up. Assuming the language dictionary comparison, hits are quite unlikely. Most comparisons can be optimized away completely.

    Bloom::Filter (or an Bit::Vector based version of it) can tell you very quickly if there is a hit in the database, but can not tell you where it matched. It produces some false positives. You have to weight the creation of the hash table against getting a nearly linear scalability.

    Playing some tricks with an 'any' char seems to be neccessary. 'ab#d' would be an extended version of lowercasing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://480930]
Approved by ww
Front-paged by kwaping
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found