Difference Of Two Strings

Actually it's a lot better. When I saw how quickly this will return undef, I realized my benchmarking sucks. I have been benching only on successful matches.

On success:

l_merlyn: 16 wallclock secs (15.60 usr + 0.00 sys = 15.60 CPU) @ 6564.10/s (n=102400)
leftover: 16 wallclock secs (15.74 usr + 0.00 sys = 15.74 CPU) @ 6505.72/s (n=102400)

On failure:

l_merlyn: 1 wallclock secs ( 1.11 usr + 0.00 sys = 1.11 CPU) @ 9225.23/s (n=10240)
leftover: 45 wallclock secs (45.12 usr + 0.00 sys = 45.12 CPU) @ 226.95/s (n=10240)

I should have remembered how slow !~ m// can be.
Definitely the optimization I was looking for!

-- Randal L. Schwartz, Perl hacker

sub leftover {
  my ($full, $part) = @_;
  my @count;
  $count[ord $1]++ while $full =~ /(\w)/g;
  while ($part =~ /(\w)/g) {
    return undef if --$count[ord $1] < 0;
  }
  my $return;
  $count[$_] and $return .= chr($_) x $count[$_] for 0..$#count;
  $return;
}
[download]

#!/usr/bin/perl -wT
use strict;
no warnings "uninitialized";  # printing undef triggers a stupid warni
+ng

# testing loop
for my $pair (['aabbcc',   'abc'],
              ['aabbccdd', 'abcdd'],
              ['abc',      'abc'],
              ['abc',      '123'],
             ) {
  my $left = leftover(@$pair);
  printf ("%8s - %-6s => %-5s\n",@$pair,"'$left'");
}

# scrabble "subtraction" subroutine
sub leftover {
  my $string  = shift;
  my $letters = shift;
  $string =~ s/$_// || return for split // => $letters;
  return $string;
}

=OUTPUT
  aabbcc - abc   => 'abc'
aabbccdd - abcdd => 'abc'
     abc - abc   => ''
     abc - 123   => ''
[download]

undef

Update: added || return to test for unmatched chars.

-Blake

leftover: 16 wallclock secs (15.80 usr + 0.00 sys = 15.80 CPU) @ 6481.01/s (n=102400)
l_blakem: 26 wallclock secs (26.07 usr + 0.00 sys = 26.07 CPU) @ 3927.89/s (n=102400)

-Blake

my $str="abcd";
my $letters = "ab";

print not_in($str, $letters),"\n";

sub not_in {
 my $str = shift;
 my $letters = shift;
 my (%hash1, %hash2);
 my @arr1 = split //, $str;
 @hash1{@arr1} = @arr1;
 my @arr2 = split //, $letters;
 my @deleted = grep $_, delete @hash1{@arr2};
 @hash2{@arr2} = undef;
 delete @hash2{@deleted};
 return if %hash2;
 join '', keys %hash1;
}
[download]

l_runrig: 17 wallclock secs (17.23 usr + 0.00 sys = 17.23 CPU) @ 5943.12/s (n=102400)
leftover: 16 wallclock secs (15.86 usr + 0.00 sys = 15.86 CPU) @ 6456.49/s (n=102400)

http://fastolfe.net/transient/2001/11/02/pm.string.difference.bench

update

sub fastolfe {
        my $source = shift;
        my $chop   = shift;
        local($_);

        my %found;
        $found{$_}++ while ($_ = chop($source)) ne '';
        while (($_ = chop($chop)) ne '') {
                return if --$found{$_} < 0;
        }

        my $result;
        foreach (sort keys %found) {
                $result .= $_ while $found{$_}--;
        }
        $result;
}
[download]

sub fast_merl {
        my ($source, $chop) = @_;
        local($_);

        my %found;
        $found{$_}++ while ($_ = chop($source)) ne '';
        while (($_ = chop($chop)) ne '') {
                return if --$found{$_} < 0;
        }

        return join "", map { $_ x $found{$_} } keys %found;
}
[download]

# demerphq's test set
             Rate blakem merlyn demq_scan demerphq fastolfe fast_merl 
+fast_c scan_c
blakem      479/s     --   -24%      -39%     -44%     -46%      -47% 
+  -97%   -98%
merlyn      629/s    31%     --      -20%     -27%     -29%      -30% 
+  -96%   -97%
demq_scan   790/s    65%    26%        --      -8%     -11%      -12% 
+  -95%   -97%
demerphq    861/s    80%    37%        9%       --      -3%       -4% 
+  -95%   -96%
fastolfe    887/s    85%    41%       12%       3%       --       -1% 
+  -94%   -96%
fast_merl   901/s    88%    43%       14%       5%       2%        -- 
+  -94%   -96%
fast_c    15708/s  3179%  2396%     1889%    1723%    1671%     1644% 
+    --   -31%
scan_c    22648/s  4627%  3499%     2767%    2529%    2453%     2415% 
+   44%     --

# simple success case
             Rate blakem merlyn demerphq fastolfe fast_merl demq_scan 
+fast_c scan_c
blakem     6244/s     --   -14%     -26%     -32%      -35%      -43% 
+  -90%   -93%
merlyn     7221/s    16%     --     -14%     -21%      -24%      -34% 
+  -89%   -92%
demerphq   8429/s    35%    17%       --      -8%      -12%      -23% 
+  -87%   -91%
fastolfe   9181/s    47%    27%       9%       --       -4%      -16% 
+  -86%   -90%
fast_merl  9563/s    53%    32%      13%       4%        --      -12% 
+  -85%   -90%
demq_scan 10908/s    75%    51%      29%      19%       14%        -- 
+  -83%   -88%
fast_c    65634/s   951%   809%     679%     615%      586%      502% 
+    --   -28%
scan_c    91428/s  1364%  1166%     985%     896%      856%      738% 
+   39%     --

# simple failure case
              Rate blakem merlyn demerphq demq_scan fastolfe fast_merl
+ scan_c fast_c
blakem      7759/s     --   -39%     -51%      -60%     -63%      -63%
+   -94%   -94%
merlyn     12666/s    63%     --     -20%      -35%     -39%      -40%
+   -89%   -91%
demerphq   15783/s   103%    25%       --      -19%     -24%      -26%
+   -87%   -89%
demq_scan  19581/s   152%    55%      24%        --      -5%       -8%
+   -84%   -86%
fastolfe   20720/s   167%    64%      31%        6%       --       -2%
+   -83%   -85%
fast_merl  21209/s   173%    67%      34%        8%       2%        --
+   -82%   -85%
scan_c    119642/s  1442%   845%     658%      511%     477%      464%
+     --   -14%
fast_c    139171/s  1694%   999%     782%      611%     572%      556%
+    16%     --
[download]

UPDATED

Fastolfe

Heh. While I was setting up my benchmark there everyone else was as well. I came up with a different set of results. First jungleboy, runrig and yuckfoo failed outright on some of my test cases. Second both merlyn and fast_merl failed because the resulting set of letters are out of order, but when I put a sort clause in as was recommended they passed fine. Yours was the fastest. Only yours, mine and blakem succeded outright.

# Updated: Removed unecessary debug code, minor tidy.
use strict;
use warnings;
use Benchmark 'cmpthese';

#-----------------------------------------------------------

our %subs = (
    yuckfoo => sub {
        my ( $full, $part ) = @_;
        my $left = "";

        my ($regx) = join ( '+.*', ( split ( '', $part ) ) );
        if ( $full =~ m{$regx} )
        {

            my (@fulls) = split ( '', $full );
            my (@parts) = split ( '', $part );

            while ( my $ch = shift (@fulls) )
            {
                ( @parts && $ch eq $parts[0] ) ? shift (@parts) :
                  ( $left .= $ch );
            }
        } else
        {
            $left = "__undef__";
        }

        return $left;
    },

    jungleboy => sub {

        my ( $full, $part ) = @_;
        my ($ch);
        my (@parts) = split ( '', $part );
        my ($regx) = join ( '+.*', @parts );

        if ( $full =~ m{$regx} )
        {
            foreach $ch (@parts)
            {
                $full =~ s/[$ch]{1}//;
            }
        } else
        {
            $full = "__undef__";
        }

        return $full;
    },

    demerphq => sub {
        my ( $from, $to ) = @_;
        my %ltrs;

        $ltrs{ substr( $from, $_, 1 ) }++ foreach 0 .. length($from) -
+ 1;
        --$ltrs{ substr( $to, $_, 1 ) } < 0
          && return "__undef__" foreach 0 .. length($to) - 1;

        return join ( "", sort map { $_ x $ltrs{$_} } keys %ltrs );
    },

    demq_scan => sub {

        my ( $from, $to ) = @_;
        my $ret = "";
        my ( $f, $t ) = ( 0, 0 );
        while (1)
        {
            my ( $fc, $tc ) = ( substr( $from, $f, 1 ), substr( $to, $
+t, 1 ) );
            if ( $fc eq $tc )
            {
                $t++;
                $f++;
                if ( substr( $to, $t, 1 ) ne $tc )
                {
                    $f++, $ret .= $fc while substr( $from, $f, 1 ) eq 
+$fc;
                }
                last if $t == length($to);
            } elsif ( $fc lt $tc )
            {
                $ret .= $fc;
                $f++;
                return "__undef__" if $f >= length $from;
            } else
            {
                return "__undef__";
            }
        }
        return $ret . substr( $from, $f );
    },

    blakem => sub {

        my $string  = shift;
        my $letters = shift;
        $string =~ s/$_// || return "__undef__" for split // => $lette
+rs;
        return $string;
    },

    fastolfe => sub {
        my $source = shift;
        my $chop   = shift;
        local ($_);

        my %found;
        $found{$_}++ while ( $_ = chop($source) ) ne '';
        while ( ( $_ = chop($chop) ) ne '' )
        {
            return "__undef__" if --$found{$_} < 0;
        }

        my $result = "";    #fixed demerphq
        foreach ( sort keys %found )
        {
            $result .= $_ while $found{$_}--;
        }
        $result;
    },

    merlyn => sub {
        my ( $full, $part ) = @_;
        my %count;
        $count{$_}++ for split //, $full;
        for ( split //, $part )
        {
            return "__undef__" if --$count{$_} < 0;
        }
        return join "", map { $_ x $count{$_} } sort keys %count;
    },

    fast_merl => sub {
        my ( $source, $chop ) = @_;
        local ($_);
        my %found;
        $found{$_}++ while ( $_ = chop($source) ) ne '';
        while ( ( $_ = chop($chop) ) ne '' )
        {
            return "__undef__" if --$found{$_} < 0;
        }
        return join "", map { $_ x $found{$_} } sort keys %found;
    }

    # Not sure why it fails..
    #    runrig => sub {
    #        my $str     = shift;
    #        my $letters = shift;
    #        my ( %hash1, %hash2 );
    #        my @arr1 = split //, $str;
    #        @hash1{@arr1} = @arr1;
    #        my @arr2 = split //, $letters;
    #        my @deleted = grep $_, delete @hash1{@arr2};
    #        @hash2{@arr2} = @arr2;
    #        delete @hash2{@deleted};
    #        return "__undef__" if %hash2;
    #        join '', keys %hash1;
    #    },
);

sub test
{
    my ($sub) = @_;
    foreach my $t (

        # tests [0] - [1] = [2]
        [qw"ab a b"],
        [qw"acciimmnnnnoootu acimnotu cimnnnoo"],
        [qw"acciimmnnnnoootu acimnotuu __undef__"],
        [qw"ab ab", ""],
        [qw"aaaaaaabbbbbbbcccccccddddddddde e aaaaaaabbbbbbbcccccccddd
+dddddd"],
        [qw"aaaaaaabbbbbbbcccccccddddddddde aaaaaaabbbbbbbcccccccddddd
+dddd e"],
        [qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa
+aabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxy z"],
        [qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa
+bbbbbbcccccddddddddghijklmnopqrstuvwxy aabccdefz"],
        [qw"abbcccdde bccd abcde"],
        [qw"abbcccdde bccdf __undef__"],
        [qw"bbcccdde abccdf __undef__"],

        #uncomment me to kill the regex versions
        #[qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxz aaaaa
+bbbbbbcccccddddddddghijklmnopqrstuvwxy __undef__"],
      )
    {

        #warn "$sub @$t\n";
        my $r = $subs{$sub}->( $t->[0], $t->[1] );
        $r = ( !defined($r) ? "undef" : $r );

        #warn "'$r'\n";
        die "in $sub expected $t->[2] got $r from $t->[0] - $t->[1]"
          if $t->[2] ne $r;
    }
}

my %tests = map { $_ => "test($_)" } keys %subs;
cmpthese( -10, \%tests );

__END__
[download]

_{Benchmark: running blakem, demerphq, demerphq_scan, fast_merl, fastolfe, jungleboy, merlyn, yuckfoo, each for at least 10 CPU seconds... blakem: 11 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 497.48/s (n=5231) demerphq: 11 wallclock secs (10.72 usr + 0.00 sys = 10.72 CPU) @ 755.88/s (n=8100) demq_scan: 11 wallclock secs (10.57 usr + 0.00 sys = 10.57 CPU) @ 731.63/s (n=7737) fast_merl: 10 wallclock secs (10.61 usr + 0.00 sys = 10.61 CPU) @ 768.34/s (n=8149) fastolfe: 10 wallclock secs (10.55 usr + 0.00 sys = 10.55 CPU) @ 779.23/s (n=8217) jungleboy: 10 wallclock secs (10.18 usr + 0.00 sys = 10.18 CPU) @ 20.03/s (n=204) merlyn: 10 wallclock secs (10.56 usr + 0.00 sys = 10.56 CPU) @ 595.40/s (n=6285) yuckfoo: 10 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 20.10/s (n=211) Rate jungleboy yuckfoo blakem merlyn demerphq_scan demerphq fast_merl fastolfe jungleboy 20.0/s -- -0% -96% -97% -97% -97% -97% -97% yuckfoo 20.1/s 0% -- -96% -97% -97% -97% -97% -97% blakem 497/s 2383% 2374% -- -16% -32% -34% -35% -36% merlyn 595/s 2872% 2861% 20% -- -19% -21% -23% -24% demq_scan 732/s 3552% 3539% 47% 23% -- -3% -5% -6% demerphq 756/s 3673% 3660% 52% 27% 3% -- -2% -3% fast_merl 768/s 3736% 3722% 54% 29% 5% 2% -- -1% fastolfe 779/s 3790% 3776% 57% 31% 7% 3% 1% --}

Yves / DeMerphq
--
Have you registered your Name Space?

sort

Good set of exhaustive tests.

Thanks for the chop modification. Nice touch. It seems to be at least 10-20% faster than split, I'll take it! I also appreciate the extensive bench compliation. Very interesting, and very informative.

I wanted to remove the split() entirely, but my knowledge of Perl isn't strong enought yet to come up with that solution. Here's a slight improvement that I came up with.

#!/usr/bin/perl

   use strict;

   my ($full, $part) = @ARGV;

   my ($left) = leftover($full, $part);

   if (defined($left)) { print "$full - $part = $left\n"; }
   else                { print "$full - $part = undef\n"; }

#-----------------------------------------------------------
sub leftover {

   my ($full, $part) = @_;
   my ($ch);
   my (@parts) = split('', $part);
   my ($regx) = join('+.*', @parts);

   if ($full =~ m{$regx}) {
       foreach $ch (@parts)
       {
           $full =~ s/[$ch]{1}//;
       }
   }
   else { $full = undef; };

   return $full;
}
[download]

Here's pseudo code for what my original idea was (maybe someone else can make it work):

sub leftover{
  my ($full, $part) = @_;
  while ($full && $part)
  {
    #RexExp to substitute nothing for the first match
    #then remove the matched character from $part
  }
}
[download]

l_jungleboy: 28 wallclock secs (28.84 usr + 0.00 sys = 28.84 CPU) @ 3550.62/s (n=102400)
leftover: 16 wallclock secs (15.87 usr + 0.00 sys = 15.87 CPU) @ 6452.43/s (n=102400)

benchmarks

Update:

use Inline C => <<'__EOC__';

SV *fast_c (char *original, char *chopped) {
        int counts[256] = {0};  /* each potential character */
        int ptr = 0;
        int buffer_size = 0;
        char *ret = NULL;
        int ret_ptr = 0;
        int error = 0;
        SV *retsv = &PL_sv_undef;

        while (original[ptr] != '\0') {
                counts[original[ptr++]]++;
                buffer_size++;
        }

        ptr = 0;

        while (!error && chopped[ptr] != '\0') {
                counts[chopped[ptr]]--;
                buffer_size--;
                if (counts[chopped[ptr++]] < 0) {
                        error++;
                }
        }

        if (!error) {
                ret = malloc(buffer_size + 1);
                for (ptr = 0; ptr <= 255; ptr++) {
                        while (counts[ptr]-- > 0) {
                                ret[ret_ptr++] = ptr;
                        }
                }
                ret[ret_ptr] = '\0';
                retsv = newSVpvn(ret, strlen(ret));
                free(ret);
        }

        return(retsv);
}

__EOC__
[download]

demerphq

use Inline C => <<'__EOC__';

SV *scan_c (char *from, char *to) {
        int f = 0;
        int t = 0;

        int from_len = strlen(from);
        int to_len = strlen(to);
        int ret_ptr = 0;
        unsigned char fc, tc;
        int error = 0;
        SV *retsv;
        char *ret;

        if (!from_len || !to_len) return(&PL_sv_undef);
        ret = malloc(from_len > to_len ? from_len+1 : to_len+1);

        while(!error) {
                fc = from[f];
                tc = to[t];

                if (fc == tc) {
                        f++; t++;
                        if (to[t] && (to[t] != tc)) {
                                while (from[f] == fc) {
                                        f++;
                                        ret[ret_ptr++] = fc;
                                }
                        }
                        if (t == to_len) error = 1;
                } else if (!fc || (fc < tc)) {
                        ret[ret_ptr++] = fc;
                        f++;
                        if (f >= from_len) error = 2;
                } else {
                        error = 2;
                }
        }

        if (error < 2) {
                while(f <= from_len) {
                        ret[ret_ptr++] = from[f++];
                }
                retsv = newSVpvn(ret, strlen(ret));
        } else {
                retsv = &PL_sv_undef;
        }
        free(ret);
        return retsv;
}

__EOC__
[download]

Fastolfe

Thanks alot, it looks like you just provided me an excuse to learn something new...

Yves / DeMerphq
--
Have you registered your Name Space?

I'm no XS or C wizard by any stretch, but it is fun to play with it in situations like this.

This method certainly won't win any speed prizes. And I'm not even sure why I did it -- except just to see it work. It relies on the fact that any composite number can be factored but one way into its constituent primes. We begin by assigning each letter of the alphabet a unique prime number. Each unordered string of letters -- a set, really -- can be represented as the product of the primes those letters represent. The difference between two strings, then, will be the quotient of the first product divided by the second, factored into its constituent primes and converted back to letters. This is assuming that the numbers evenly divide, i.e. that the remainder of the division is zero. If not, the string difference is undefined, because it means the second string has letters in it not contained in the first.

The following program illustrates the technique:

use Math::BigInt;
@Primes = (2,3,5,7,11,13,17,19,23,29,31,37,41,
    43,47,53,59,61,67,71,73,79,83,89,97,101);
@AsciiPrimes = ((0) x 65, @Primes, (0) x 6, @Primes, (0) x 133);

print StringDif('abcdgoldfish', 'flash'),"\n";
print StringDif('lmnogoldfish', 'dish'),"\n";
print StringDif('osar', 'oar'),"\n";

sub StringDif {
    my ($StrA, $StrB) = @_;
    my $A = Math::BigInt->new(1);
    my $B = Math::BigInt->new(1);
    my $Quot, $Rem, $NewQuot, $ReturnStr;
    foreach (unpack 'C*', $StrA) {$A *= $AsciiPrimes[$_]}
    foreach (unpack 'C*', $StrB) {$B *= $AsciiPrimes[$_]}
    if ($A eq '+0' or $B eq '+0') {
        warn "Non-letter in string"; return undef
    }
    ($Quot, $Rem) = Math::BigInt->new($A)->bdiv($B);
    return undef unless $Rem eq '+0';
    $ReturnStr = '';
    foreach('a'..'z') {
        do {
            ($NewQuot, $Rem) = Math::BigInt->new($Quot)
                ->bdiv($AsciiPrimes[ord $_]);
            if ($Rem eq '+0') {
                $ReturnStr .= $_;
                $Quot = $NewQuot
            }
        } while $Rem eq '+0' 
    }
    return $ReturnStr
}
[download]