Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: If there a way to find the location of the first difference between two strings?

by jaredor (Curate)
on Mar 28, 2012 at 05:03 UTC ( #962072=note: print w/ replies, xml ) Need Help??

in reply to If there a way to find the location of the first difference between two strings?

If you are going to use subroutines in lieu of one-liners, here's a non-bit twiddling version. You've already got a solution, so please forgive the redundancy; this is a nice excuse for me to practice implementing streams a la HOP: ["Higher-Order Perl" now available for free download]

If you, like me, feel leery using bit-wise operations on strings that might be Unicode, this may be a more comforting approach. While your other solutions may indeed work 100% of the time with Unicode strings as well, that is too much thinking and worry for me so I just punt to the standard string manipulation & comparison functions in perl.

The code doesn't have any error checking, but the compare stream tries to do the right thing (per my tastes) for the boundary cases. Obviously you can change stream output behavior to taste. I left the characters in the output to better demonstrate the results. (And I generalized it a little to allow use of streams of characters in addition to just strings, so the YAGNI line tax is 1, leaving aside the YAGNI maintenance tax ;-)

I'm usually late the party so I don't expect many to see this, but if anyone has suggestions for improvement I would be interested in hearing them.

This code

#!/usr/bin/env perl use Modern::Perl; use Data::Dump qw(pp); my ($foo, $bar) = ("abcdef", "abdfec"); sub pop_char { my @chars; { no warnings; @chars = map { split '' } @_; } return sub { return shift @chars; } } sub cmp_str { my ($i, $s1, $s2) = (0, @_); $s1 = pop_char $s1 if not ref $s1; $s2 = pop_char $s2 if not ref $s2; return sub { no warnings; my ($c1, $c2) = ($s1->(), $s2->()); return ($c1 or $c2) ? [ $i++, $c1 cmp $c2, $c1, $c2 ] : undef; } } my ($cmp_foo_bar, $cmp_char); say "\nNull case: both strings empty"; $cmp_foo_bar = cmp_str; say pp $cmp_char while defined ($cmp_char = $cmp_foo_bar->()); say "\nSecond string null"; $cmp_foo_bar = cmp_str $foo; say pp $cmp_char while defined ($cmp_char = $cmp_foo_bar->()); say "\nFirst string null"; $cmp_foo_bar = cmp_str '', $bar; say pp $cmp_char while defined ($cmp_char = $cmp_foo_bar->()); $cmp_foo_bar = cmp_str $foo, $bar; say "\nBoth strings same length"; say pp $cmp_char while defined ($cmp_char = $cmp_foo_bar->()); $cmp_foo_bar = cmp_str $foo.$bar, $bar; say "\nFirst string longer"; say pp $cmp_char while defined ($cmp_char = $cmp_foo_bar->()); exit; $cmp_foo_bar = cmp_str $foo.$bar, sub { return 'A'; }; say "\nBoth strings against infinite A's"; say pp $cmp_char while defined ($cmp_char = $cmp_foo_bar->());


Null case: both strings empty Second string null [0, 1, "a", undef] [1, 1, "b", undef] [2, 1, "c", undef] [3, 1, "d", undef] [4, 1, "e", undef] [5, 1, "f", undef] First string null [0, -1, undef, "a"] [1, -1, undef, "b"] [2, -1, undef, "d"] [3, -1, undef, "f"] [4, -1, undef, "e"] [5, -1, undef, "c"] Both strings same length [0, 0, "a", "a"] [1, 0, "b", "b"] [2, -1, "c", "d"] [3, -1, "d", "f"] [4, 0, "e", "e"] [5, 1, "f", "c"] First string longer [0, 0, "a", "a"] [1, 0, "b", "b"] [2, -1, "c", "d"] [3, -1, "d", "f"] [4, 0, "e", "e"] [5, 1, "f", "c"] [6, 1, "a", undef] [7, 1, "b", undef] [8, 1, "d", undef] [9, 1, "f", undef] [10, 1, "e", undef] [11, 1, "c", undef]

Replies are listed 'Best First'.
Re^2: If there a way to find the location of the first difference between two strings?
by flexvault (Monsignor) on Mar 28, 2012 at 16:08 UTC


    Thanks for your input. If you notice in the original post I said "use bytes" to eliminate concerns about UCA.

    perl -e 'use bytes;$s1="abcd";$s2="abcz";$dif=$cmp=$s1 cmp $s2;print " +$dif\t$cmp\n";'

    The performance hit for using UCA is just too great. In some of my tests, the performance was degradated by as much as 10,000%. As for "bit-wise operations on strings", I have a math background and started programming by writing code in machine language, and later assembler, Basic, Fortran, C, and a lot of others, until I had the good fortune of being introduced to Perl.

    To explain why performance is so critical, I have been writing a "pure-perl" data base engine, to replace Oracle's BerkeleyDB and MySQL in all of our products. So our goal was to come within 20% of the performance of Oracle products. As it turns out, our clients will see enhanced performance when we switch them over, and we will be able to provide database support on any platform that Perl runs on. (An area where Perl excels!)

    I have been very impressed with the performance of Perl since 5.8.x. So in profiling( -d:NTYProf ) of the code, the routine I asked about, is called 14,595,348 times on a test of writing 100K records. So even a slight improvement would be welcome. Thanks to the PM answers, I got a 376% increase in performance. Great!
    (Note: Some of our clients have databases with billions of records.)

    When I wrote Perl performance just gets better and better! my intent was in showing that Perl has improved over the years. It was the first time that I had an actual test case to run on several versions of Perl from 5.6.1 to 5.12.2. Since then I have tested with 5.14.2 with even better results. I don't know why Perl performance is improving for this type of work, but I can demonstrate that it is. I also have incorrectly used the term "modern Perl" in the past, since I didn't realize that a module "Modern::Perl" existed.

    Thank you and Good Luck!

    "Well done is better than well said." - Benjamin Franklin

      I didn't realize that a module "Modern::Perl" existed.

      It's just a silly little shortcut to enable new (and should-have-been-on-by-default) features in the most recent releases of Perl 5. "Modern" is deliberately vague.

      Thanks for the background flexvault, I doubt I would have posted anything had I known you were doing something with database keys. I thought you might be writing some sort of diff routine for a homebrew editor or some such. (I should have checked you out anyway to see that you've way too much history and mojo to need to be told about iterators.)

      I looked more at JavaFan and jwkrahn's solutions than your initial statement of the problem, so overlooked your use of the bytes pragma. I guess I'm conditioned to look for the -M and -m options. I've never used the bytes module, which seems to make all strings just byte vectors. Modding out by endianess, do you think there's some sort of bit-wise C idiom out there to capitalize on the fact one and only one of ($s1 & ($s1 ^ $s2)) or ($s2 & ($s1 ^ $s2)) will have the "high order bit"? You might be able to get away from using a regexp by, e.g., craftily using bit shifts. But I'm unfamiliar with issues such as if using numerical ordering in database keys impacts performance with things that might have a different lexicographic ordering.

      I don't think you need to apologize for using "modern Perl" in a general sense. chromatic puts that include at the top of his responses in PM and it's good PR for his excellent book, Modern Perl 2011-2012 Edition, but knowledgeable folk such as yourself are given lots of latitude by students such as myself, who learn a lot whenever you produce a "modern Perl" example.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962072]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2016-07-23 22:26 GMT
Find Nodes?
    Voting Booth?
    What is your favorite alternate name for a (specific) keyboard key?

    Results (221 votes). Check out past polls.