Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

If there a way to find the location of the first difference between two strings?

by flexvault (Monsignor)
on Mar 26, 2012 at 09:02 UTC ( #961622=perlquestion: print w/replies, xml ) Need Help??

flexvault has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Since looping on characters is not one of Perl's strong points, is there a way to get the location of the first difference between two strings? (Note: I typed the code in since I can't cut and paste between my browser and my Xterm.)

perl -e 'use bytes;$s1="abcd";$s2="abcz";$dif=$cmp=$s1 cmp $s2;print " +$dif\t$cmp\n";'

What I would like to get, is the following:

3 -1

Obviously it doesn't work that way, but is there an alternate that I'm not aware of?

Thank you

"Well done is better than well said." - Benjamin Franklin

Replies are listed 'Best First'.
Re: If there a way to find the location of the first difference between two strings?
by JavaFan (Canon) on Mar 26, 2012 at 09:21 UTC
    use 5.010; my $t = $s1 ^ $s2; my ($f) = $t =~ /^(\x{00}*)/; say length $f;
      use 5.010; my $t = $s1 ^ $s2; my ($f) = $t =~ /^(\x{00}*)/; say length $f;

      Another way to do that:

      use 5.010; my $t = $s1 ^ $s2; $t =~ /^\0*/ && say $+[0]

        JavaFan and jwkrahn,

        C o o l !

        That's the code I was hoping for...

        Need to benchmark now!

        UPDATE: My apologies to JavaFan, I didn't realize his post was first. Both solutions Benchmarked are better than mine (376%).

        Thank you

        "Well done is better than well said." - Benjamin Franklin

Re: If there a way to find the location of the first difference between two strings?
by AnomalousMonk (Bishop) on Mar 26, 2012 at 16:15 UTC

    Note that if the strings being compared are equal, both JavaFan and jwkrahn's solutions yield a number for the first difference offset that is equal to the length of the string(s).

    >perl -wMstrict -lE "my $s1 = 'abcdefg'; my $s2 = 'abcdefg'; ;; my $t = $s1 ^ $s2; ;; my ($f) = $t =~ /^(\x{00}*)/; say length $f; ;; $t =~ /^\0*/ && say $+[0]; " 7 7

      I still needed to know how they compare (lt eq gt), so I did the following:

      srand( 711 ); . . . sub case2 { my $s1 = 'a' x 39 . 'b'; my $s2 = 'a' x 39 . chr( 98 + rand(4)); my ( $cmp, $loc ) = Cmp_and_Loc ( $s1, $s2 ); } sub Cmp_and_Loc { my ( $s1, $s2, undef ) = @_; my $t = 0; my $cmp = $s1 cmp $s2; if ( $cmp ) { $t = $s1 ^ $s2; $t =~ /^\0*)/; $t = $+[0]; } return ( $cmp, $t ); }

      This appeared to work since if the strings are equal the sub returns 0,0. I will test your variation, but it still flies compared to anything I tried. Using the rand(4), I think I was able to test all combinations.

      As always, when someone shows you how to do it correctly, it becomes easy :-)

      Thank you

      "Well done is better than well said." - Benjamin Franklin

      Small fix:
      $t =~ /[^\0]/ && say $-[0];
Re: If there a way to find the location of the first difference between two strings?
by jaredor (Priest) on Mar 28, 2012 at 05:03 UTC

    If you are going to use subroutines in lieu of one-liners, here's a non-bit twiddling version. You've already got a solution, so please forgive the redundancy; this is a nice excuse for me to practice implementing streams a la HOP: ["Higher-Order Perl" now available for free download]

    If you, like me, feel leery using bit-wise operations on strings that might be Unicode, this may be a more comforting approach. While your other solutions may indeed work 100% of the time with Unicode strings as well, that is too much thinking and worry for me so I just punt to the standard string manipulation & comparison functions in perl.

    The code doesn't have any error checking, but the compare stream tries to do the right thing (per my tastes) for the boundary cases. Obviously you can change stream output behavior to taste. I left the characters in the output to better demonstrate the results. (And I generalized it a little to allow use of streams of characters in addition to just strings, so the YAGNI line tax is 1, leaving aside the YAGNI maintenance tax ;-)

    I'm usually late the party so I don't expect many to see this, but if anyone has suggestions for improvement I would be interested in hearing them.

    This code



      Thanks for your input. If you notice in the original post I said "use bytes" to eliminate concerns about UCA.

      perl -e 'use bytes;$s1="abcd";$s2="abcz";$dif=$cmp=$s1 cmp $s2;print " +$dif\t$cmp\n";'

      The performance hit for using UCA is just too great. In some of my tests, the performance was degradated by as much as 10,000%. As for "bit-wise operations on strings", I have a math background and started programming by writing code in machine language, and later assembler, Basic, Fortran, C, and a lot of others, until I had the good fortune of being introduced to Perl.

      To explain why performance is so critical, I have been writing a "pure-perl" data base engine, to replace Oracle's BerkeleyDB and MySQL in all of our products. So our goal was to come within 20% of the performance of Oracle products. As it turns out, our clients will see enhanced performance when we switch them over, and we will be able to provide database support on any platform that Perl runs on. (An area where Perl excels!)

      I have been very impressed with the performance of Perl since 5.8.x. So in profiling( -d:NTYProf ) of the code, the routine I asked about, is called 14,595,348 times on a test of writing 100K records. So even a slight improvement would be welcome. Thanks to the PM answers, I got a 376% increase in performance. Great!
      (Note: Some of our clients have databases with billions of records.)

      When I wrote Perl performance just gets better and better! my intent was in showing that Perl has improved over the years. It was the first time that I had an actual test case to run on several versions of Perl from 5.6.1 to 5.12.2. Since then I have tested with 5.14.2 with even better results. I don't know why Perl performance is improving for this type of work, but I can demonstrate that it is. I also have incorrectly used the term "modern Perl" in the past, since I didn't realize that a module "Modern::Perl" existed.

      Thank you and Good Luck!

      "Well done is better than well said." - Benjamin Franklin

        I didn't realize that a module "Modern::Perl" existed.

        It's just a silly little shortcut to enable new (and should-have-been-on-by-default) features in the most recent releases of Perl 5. "Modern" is deliberately vague.

        Thanks for the background flexvault, I doubt I would have posted anything had I known you were doing something with database keys. I thought you might be writing some sort of diff routine for a homebrew editor or some such. (I should have checked you out anyway to see that you've way too much history and mojo to need to be told about iterators.)

        I looked more at JavaFan and jwkrahn's solutions than your initial statement of the problem, so overlooked your use of the bytes pragma. I guess I'm conditioned to look for the -M and -m options. I've never used the bytes module, which seems to make all strings just byte vectors. Modding out by endianess, do you think there's some sort of bit-wise C idiom out there to capitalize on the fact one and only one of ($s1 & ($s1 ^ $s2)) or ($s2 & ($s1 ^ $s2)) will have the "high order bit"? You might be able to get away from using a regexp by, e.g., craftily using bit shifts. But I'm unfamiliar with issues such as if using numerical ordering in database keys impacts performance with things that might have a different lexicographic ordering.

        I don't think you need to apologize for using "modern Perl" in a general sense. chromatic puts that include at the top of his responses in PM and it's good PR for his excellent book, Modern Perl 2011-2012 Edition, but knowledgeable folk such as yourself are given lots of latitude by students such as myself, who learn a lot whenever you produce a "modern Perl" example.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://961622]
Approved by Corion
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2021-01-27 04:22 GMT
Find Nodes?
    Voting Booth?