Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Difference Of Two Strings

by YuckFoo (Abbot)
on Nov 03, 2001 at 04:34 UTC ( [id://122959]=perlquestion: print w/replies, xml ) Need Help??

YuckFoo has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks,

I am optimizing (for speed) my anagram script. I want to optimize this 'leftover' subroutine. It returns a string with characters that are in the first string that are not matched in the second string. Both strings are sorted. If the second string contains characters that can't be matched in the first string, undef is returned.

acciimmnnnnoootu - acimnotu = cimnnnoo
acciimmnnnnoootu - acimnotuu = undef

Spliting both strings to arrays first seems a bit brutish and expensive, but I can't find a faster way. Any ideas?

Thanks!

#!/usr/bin/perl use strict; my ($full, $part) = @ARGV; my ($left) = leftover($full, $part); if (defined($left)) { print "$full - $part = $left\n"; } else { print "$full - $part = undef\n"; } #----------------------------------------------------------- sub leftover { my ($full, $part) = @_; my ($ch, $left); my ($regx) = join('+.*', (split('', $part))); if ($full =~ m{$regx}) { my (@fulls) = split('', $full); my (@parts) = split('', $part); while ($ch = shift(@fulls)) { ($ch eq $parts[0]) ? shift(@parts) : ($left .= $ch); } } else { $left = undef; }; return $left; }

Replies are listed 'Best First'.
Re: Difference Of Two Strings
by merlyn (Sage) on Nov 03, 2001 at 05:16 UTC
    Untested for speed, but it's functional:
    sub leftover { my ($full, $part) = @_; my %count; $count{$_}++ for split //, $full; for (split //, $part) { return undef if --$count{$_} < 0; } return join "", map { $_ x $count{$_} } keys %count; }

    -- Randal L. Schwartz, Perl hacker

      Winner by a nose! Thanks merlyn.

      Actually it's a lot better. When I saw how quickly this will return undef, I realized my benchmarking sucks. I have been benching only on successful matches.

      On success:

      l_merlyn: 16 wallclock secs (15.60 usr + 0.00 sys = 15.60 CPU) @ 6564.10/s (n=102400)
      leftover: 16 wallclock secs (15.74 usr + 0.00 sys = 15.74 CPU) @ 6505.72/s (n=102400)

      On failure:

      l_merlyn: 1 wallclock secs ( 1.11 usr + 0.00 sys = 1.11 CPU) @ 9225.23/s (n=10240)
      leftover: 45 wallclock secs (45.12 usr + 0.00 sys = 45.12 CPU) @ 226.95/s (n=10240)

      I should have remembered how slow !~ m// can be.
      Definitely the optimization I was looking for!

        You might also try this one for speed, which avoids building the list unnecessarily, and uses arrays instead of hashes to avoid the hashing algorithm for just a single character:
        sub leftover { my ($full, $part) = @_; my @count; $count[ord $1]++ while $full =~ /(\w)/g; while ($part =~ /(\w)/g) { return undef if --$count[ord $1] < 0; } my $return; $count[$_] and $return .= chr($_) x $count[$_] for 0..$#count; $return; }

        -- Randal L. Schwartz, Perl hacker

Re: Difference Of Two Strings
by blakem (Monsignor) on Nov 03, 2001 at 04:43 UTC
    I tried to come up with a solution using tr/// since its much faster, but had to bring in the s/// operator instead...
    #!/usr/bin/perl -wT use strict; no warnings "uninitialized"; # printing undef triggers a stupid warni +ng # testing loop for my $pair (['aabbcc', 'abc'], ['aabbccdd', 'abcdd'], ['abc', 'abc'], ['abc', '123'], ) { my $left = leftover(@$pair); printf ("%8s - %-6s => %-5s\n",@$pair,"'$left'"); } # scrabble "subtraction" subroutine sub leftover { my $string = shift; my $letters = shift; $string =~ s/$_// || return for split // => $letters; return $string; } =OUTPUT aabbcc - abc => 'abc' aabbccdd - abcdd => 'abc' abc - abc => '' abc - 123 => ''
    You'll have to be careful in your usage of this sub, since your specs state nearly opposite meanings for the return values of '' and undef.

    Update: added || return to test for unmatched chars.

    -Blake

      Thanks for the new approach, I hadn't tried this one. Unfortunately it's quite a bit slower, and doesn't check for characters in the second string not matched in the first.

      leftover: 16 wallclock secs (15.80 usr + 0.00 sys = 15.80 CPU) @ 6481.01/s (n=102400)
      l_blakem: 26 wallclock secs (26.07 usr + 0.00 sys = 26.07 CPU) @ 3927.89/s (n=102400)

        I updated it slightly to check for the extra chars....

        -Blake

Re: Difference Of Two Strings
by runrig (Abbot) on Nov 03, 2001 at 04:51 UTC
    my $str="abcd"; my $letters = "ab"; print not_in($str, $letters),"\n"; sub not_in { my $str = shift; my $letters = shift; my (%hash1, %hash2); my @arr1 = split //, $str; @hash1{@arr1} = @arr1; my @arr2 = split //, $letters; my @deleted = grep $_, delete @hash1{@arr2}; @hash2{@arr2} = undef; delete @hash2{@deleted}; return if %hash2; join '', keys %hash1; }
    Updated to more accurately reflect the requirements :-)
    Actually, this one doesn't remove characters one for one as it seems you want to. If a character appears in the second arg, it removes all of that character in the first arg. That's not what you wanted is it??
      Thanks runrig, I didn't even try the hash method. I thought it would be too slow. I was surprised how quick it was.

      l_runrig: 17 wallclock secs (17.23 usr + 0.00 sys = 17.23 CPU) @ 5943.12/s (n=102400)
      leftover: 16 wallclock secs (15.86 usr + 0.00 sys = 15.86 CPU) @ 6456.49/s (n=102400)

Re: Difference Of Two Strings (complete benchmarks)
by Fastolfe (Vicar) on Nov 03, 2001 at 05:33 UTC
    Another implementation for you. This one seems pretty speedy (update: optimized and benchmarks updated with everyone else's updates):
    sub fastolfe { my $source = shift; my $chop = shift; local($_); my %found; $found{$_}++ while ($_ = chop($source)) ne ''; while (($_ = chop($chop)) ne '') { return if --$found{$_} < 0; } my $result; foreach (sort keys %found) { $result .= $_ while $found{$_}--; } $result; }
    Hybridizing the above with merlyn's version, we can squeeze out a bit more speed:
    sub fast_merl { my ($source, $chop) = @_; local($_); my %found; $found{$_}++ while ($_ = chop($source)) ne ''; while (($_ = chop($chop)) ne '') { return if --$found{$_} < 0; } return join "", map { $_ x $found{$_} } keys %found; }
    Benchmarking most of the versions I see thus far (minus a couple of the less interesting ones, because these benchmarks are getting big), using test input from above (2 success, 2 fail) as well as demerphq's tests below:
    # demerphq's test set Rate blakem merlyn demq_scan demerphq fastolfe fast_merl +fast_c scan_c blakem 479/s -- -24% -39% -44% -46% -47% + -97% -98% merlyn 629/s 31% -- -20% -27% -29% -30% + -96% -97% demq_scan 790/s 65% 26% -- -8% -11% -12% + -95% -97% demerphq 861/s 80% 37% 9% -- -3% -4% + -95% -96% fastolfe 887/s 85% 41% 12% 3% -- -1% + -94% -96% fast_merl 901/s 88% 43% 14% 5% 2% -- + -94% -96% fast_c 15708/s 3179% 2396% 1889% 1723% 1671% 1644% + -- -31% scan_c 22648/s 4627% 3499% 2767% 2529% 2453% 2415% + 44% -- # simple success case Rate blakem merlyn demerphq fastolfe fast_merl demq_scan +fast_c scan_c blakem 6244/s -- -14% -26% -32% -35% -43% + -90% -93% merlyn 7221/s 16% -- -14% -21% -24% -34% + -89% -92% demerphq 8429/s 35% 17% -- -8% -12% -23% + -87% -91% fastolfe 9181/s 47% 27% 9% -- -4% -16% + -86% -90% fast_merl 9563/s 53% 32% 13% 4% -- -12% + -85% -90% demq_scan 10908/s 75% 51% 29% 19% 14% -- + -83% -88% fast_c 65634/s 951% 809% 679% 615% 586% 502% + -- -28% scan_c 91428/s 1364% 1166% 985% 896% 856% 738% + 39% -- # simple failure case Rate blakem merlyn demerphq demq_scan fastolfe fast_merl + scan_c fast_c blakem 7759/s -- -39% -51% -60% -63% -63% + -94% -94% merlyn 12666/s 63% -- -20% -35% -39% -40% + -89% -91% demerphq 15783/s 103% 25% -- -19% -24% -26% + -87% -89% demq_scan 19581/s 152% 55% 24% -- -5% -8% + -84% -86% fastolfe 20720/s 167% 64% 31% 6% -- -2% + -83% -85% fast_merl 21209/s 173% 67% 34% 8% 2% -- + -82% -85% scan_c 119642/s 1442% 845% 658% 511% 477% 464% + -- -14% fast_c 139171/s 1694% 999% 782% 611% 572% 556% + 16% --
    Source: http://fastolfe.net/transient/2001/11/02/pm.string.difference.bench
      UPDATED to reflect Fastolfes reply.

      Heh. While I was setting up my benchmark there everyone else was as well. I came up with a different set of results. First jungleboy, runrig and yuckfoo failed outright on some of my test cases. Second both merlyn and fast_merl failed because the resulting set of letters are out of order, but when I put a sort clause in as was recommended they passed fine. Yours was the fastest. Only yours, mine and blakem succeded outright.

      # Updated: Removed unecessary debug code, minor tidy. use strict; use warnings; use Benchmark 'cmpthese'; #----------------------------------------------------------- our %subs = ( yuckfoo => sub { my ( $full, $part ) = @_; my $left = ""; my ($regx) = join ( '+.*', ( split ( '', $part ) ) ); if ( $full =~ m{$regx} ) { my (@fulls) = split ( '', $full ); my (@parts) = split ( '', $part ); while ( my $ch = shift (@fulls) ) { ( @parts && $ch eq $parts[0] ) ? shift (@parts) : ( $left .= $ch ); } } else { $left = "__undef__"; } return $left; }, jungleboy => sub { my ( $full, $part ) = @_; my ($ch); my (@parts) = split ( '', $part ); my ($regx) = join ( '+.*', @parts ); if ( $full =~ m{$regx} ) { foreach $ch (@parts) { $full =~ s/[$ch]{1}//; } } else { $full = "__undef__"; } return $full; }, demerphq => sub { my ( $from, $to ) = @_; my %ltrs; $ltrs{ substr( $from, $_, 1 ) }++ foreach 0 .. length($from) - + 1; --$ltrs{ substr( $to, $_, 1 ) } < 0 && return "__undef__" foreach 0 .. length($to) - 1; return join ( "", sort map { $_ x $ltrs{$_} } keys %ltrs ); }, demq_scan => sub { my ( $from, $to ) = @_; my $ret = ""; my ( $f, $t ) = ( 0, 0 ); while (1) { my ( $fc, $tc ) = ( substr( $from, $f, 1 ), substr( $to, $ +t, 1 ) ); if ( $fc eq $tc ) { $t++; $f++; if ( substr( $to, $t, 1 ) ne $tc ) { $f++, $ret .= $fc while substr( $from, $f, 1 ) eq +$fc; } last if $t == length($to); } elsif ( $fc lt $tc ) { $ret .= $fc; $f++; return "__undef__" if $f >= length $from; } else { return "__undef__"; } } return $ret . substr( $from, $f ); }, blakem => sub { my $string = shift; my $letters = shift; $string =~ s/$_// || return "__undef__" for split // => $lette +rs; return $string; }, fastolfe => sub { my $source = shift; my $chop = shift; local ($_); my %found; $found{$_}++ while ( $_ = chop($source) ) ne ''; while ( ( $_ = chop($chop) ) ne '' ) { return "__undef__" if --$found{$_} < 0; } my $result = ""; #fixed demerphq foreach ( sort keys %found ) { $result .= $_ while $found{$_}--; } $result; }, merlyn => sub { my ( $full, $part ) = @_; my %count; $count{$_}++ for split //, $full; for ( split //, $part ) { return "__undef__" if --$count{$_} < 0; } return join "", map { $_ x $count{$_} } sort keys %count; }, fast_merl => sub { my ( $source, $chop ) = @_; local ($_); my %found; $found{$_}++ while ( $_ = chop($source) ) ne ''; while ( ( $_ = chop($chop) ) ne '' ) { return "__undef__" if --$found{$_} < 0; } return join "", map { $_ x $found{$_} } sort keys %found; } # Not sure why it fails.. # runrig => sub { # my $str = shift; # my $letters = shift; # my ( %hash1, %hash2 ); # my @arr1 = split //, $str; # @hash1{@arr1} = @arr1; # my @arr2 = split //, $letters; # my @deleted = grep $_, delete @hash1{@arr2}; # @hash2{@arr2} = @arr2; # delete @hash2{@deleted}; # return "__undef__" if %hash2; # join '', keys %hash1; # }, ); sub test { my ($sub) = @_; foreach my $t ( # tests [0] - [1] = [2] [qw"ab a b"], [qw"acciimmnnnnoootu acimnotu cimnnnoo"], [qw"acciimmnnnnoootu acimnotuu __undef__"], [qw"ab ab", ""], [qw"aaaaaaabbbbbbbcccccccddddddddde e aaaaaaabbbbbbbcccccccddd +dddddd"], [qw"aaaaaaabbbbbbbcccccccddddddddde aaaaaaabbbbbbbcccccccddddd +dddd e"], [qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa +aabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxy z"], [qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa +bbbbbbcccccddddddddghijklmnopqrstuvwxy aabccdefz"], [qw"abbcccdde bccd abcde"], [qw"abbcccdde bccdf __undef__"], [qw"bbcccdde abccdf __undef__"], #uncomment me to kill the regex versions #[qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxz aaaaa +bbbbbbcccccddddddddghijklmnopqrstuvwxy __undef__"], ) { #warn "$sub @$t\n"; my $r = $subs{$sub}->( $t->[0], $t->[1] ); $r = ( !defined($r) ? "undef" : $r ); #warn "'$r'\n"; die "in $sub expected $t->[2] got $r from $t->[0] - $t->[1]" if $t->[2] ne $r; } } my %tests = map { $_ => "test($_)" } keys %subs; cmpthese( -10, \%tests ); __END__
      Benchmark: running blakem, demerphq, demerphq_scan, fast_merl, fastolfe, jungleboy, merlyn, yuckfoo, 
      each for at least 10 CPU seconds...
          blakem: 11 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 497.48/s (n=5231)
        demerphq: 11 wallclock secs (10.72 usr +  0.00 sys = 10.72 CPU) @ 755.88/s (n=8100)
       demq_scan: 11 wallclock secs (10.57 usr +  0.00 sys = 10.57 CPU) @ 731.63/s (n=7737)
       fast_merl: 10 wallclock secs (10.61 usr +  0.00 sys = 10.61 CPU) @ 768.34/s (n=8149)
        fastolfe: 10 wallclock secs (10.55 usr +  0.00 sys = 10.55 CPU) @ 779.23/s (n=8217)
       jungleboy: 10 wallclock secs (10.18 usr +  0.00 sys = 10.18 CPU) @ 20.03/s (n=204)
          merlyn: 10 wallclock secs (10.56 usr +  0.00 sys = 10.56 CPU) @ 595.40/s (n=6285)
         yuckfoo: 10 wallclock secs (10.50 usr +  0.00 sys = 10.50 CPU) @ 20.10/s (n=211)
                      Rate jungleboy yuckfoo blakem merlyn demerphq_scan demerphq fast_merl fastolfe
      jungleboy     20.0/s        --     -0%   -96%   -97%          -97%     -97%      -97%     -97%
      yuckfoo       20.1/s        0%      --   -96%   -97%          -97%     -97%      -97%     -97%
      blakem         497/s     2383%   2374%     --   -16%          -32%     -34%      -35%     -36%
      merlyn         595/s     2872%   2861%    20%     --          -19%     -21%      -23%     -24%
      demq_scan      732/s     3552%   3539%    47%    23%            --      -3%       -5%      -6%
      demerphq       756/s     3673%   3660%    52%    27%            3%       --       -2%      -3%
      fast_merl      768/s     3736%   3722%    54%    29%            5%       2%        --      -1%
      fastolfe       779/s     3790%   3776%    57%    31%            7%       3%        1%       --
      

      Yves / DeMerphq
      --
      Have you registered your Name Space?

        Yah I took out the sort in fast_merl because the requirements didn't state the returned string had to be sorted, just that the input was. The sort has a very slight performance penalty. It's pretty easy to put back.

        Good set of exhaustive tests.

      Thanks for the chop modification. Nice touch. It seems to be at least 10-20% faster than split, I'll take it! I also appreciate the extensive bench compliation. Very interesting, and very informative.
Re: Difference Of Two Strings
by JungleBoy (Scribe) on Nov 03, 2001 at 05:06 UTC

    I wanted to remove the split() entirely, but my knowledge of Perl isn't strong enought yet to come up with that solution. Here's a slight improvement that I came up with.

    #!/usr/bin/perl use strict; my ($full, $part) = @ARGV; my ($left) = leftover($full, $part); if (defined($left)) { print "$full - $part = $left\n"; } else { print "$full - $part = undef\n"; } #----------------------------------------------------------- sub leftover { my ($full, $part) = @_; my ($ch); my (@parts) = split('', $part); my ($regx) = join('+.*', @parts); if ($full =~ m{$regx}) { foreach $ch (@parts) { $full =~ s/[$ch]{1}//; } } else { $full = undef; }; return $full; }

    Here's pseudo code for what my original idea was (maybe someone else can make it work):

    sub leftover{ my ($full, $part) = @_; while ($full && $part) { #RexExp to substitute nothing for the first match #then remove the matched character from $part } }
      Seems anything with s// in a loop doesn't stand a chance.
      I played with @- and @+ from s//, but these too, are expensive.
      I think this is what you are talking about in the pseudocode?

      l_jungleboy: 28 wallclock secs (28.84 usr + 0.00 sys = 28.84 CPU) @ 3550.62/s (n=102400)
      leftover: 16 wallclock secs (15.87 usr + 0.00 sys = 15.87 CPU) @ 6452.43/s (n=102400)

Re: Difference Of Two Strings (in C)
by Fastolfe (Vicar) on Nov 04, 2001 at 00:38 UTC
    And lastly, since it's speed we're after, a C implementation can't hurt. This benchmarks orders of magnitude faster than anything seen yet. (Update: fixed memory leak)
    use Inline C => <<'__EOC__'; SV *fast_c (char *original, char *chopped) { int counts[256] = {0}; /* each potential character */ int ptr = 0; int buffer_size = 0; char *ret = NULL; int ret_ptr = 0; int error = 0; SV *retsv = &PL_sv_undef; while (original[ptr] != '\0') { counts[original[ptr++]]++; buffer_size++; } ptr = 0; while (!error && chopped[ptr] != '\0') { counts[chopped[ptr]]--; buffer_size--; if (counts[chopped[ptr++]] < 0) { error++; } } if (!error) { ret = malloc(buffer_size + 1); for (ptr = 0; ptr <= 255; ptr++) { while (counts[ptr]-- > 0) { ret[ret_ptr++] = ptr; } } ret[ret_ptr] = '\0'; retsv = newSVpvn(ret, strlen(ret)); free(ret); } return(retsv); } __EOC__
    And then here's a C implementation of demerphq's "scanning" method, which doesn't rely on counting up letters. This one's even faster:
    use Inline C => <<'__EOC__'; SV *scan_c (char *from, char *to) { int f = 0; int t = 0; int from_len = strlen(from); int to_len = strlen(to); int ret_ptr = 0; unsigned char fc, tc; int error = 0; SV *retsv; char *ret; if (!from_len || !to_len) return(&PL_sv_undef); ret = malloc(from_len > to_len ? from_len+1 : to_len+1); while(!error) { fc = from[f]; tc = to[t]; if (fc == tc) { f++; t++; if (to[t] && (to[t] != tc)) { while (from[f] == fc) { f++; ret[ret_ptr++] = fc; } } if (t == to_len) error = 1; } else if (!fc || (fc < tc)) { ret[ret_ptr++] = fc; f++; if (f >= from_len) error = 2; } else { error = 2; } } if (error < 2) { while(f <= from_len) { ret[ret_ptr++] = from[f++]; } retsv = newSVpvn(ret, strlen(ret)); } else { retsv = &PL_sv_undef; } free(ret); return retsv; } __EOC__
      Thats really cool Fastolfe its neat to see my code converted to C. Im really going to have to bone up my skills in that area.

      Thanks alot, it looks like you just provided me an excuse to learn something new...

      Yves / DeMerphq
      --
      Have you registered your Name Space?

        "converted to C"... poorly. :)

        I'm no XS or C wizard by any stretch, but it is fun to play with it in situations like this.

Re: Difference Of Two Strings
by Dr. Mu (Hermit) on Nov 05, 2001 at 09:29 UTC
    And now for something a little different...

    This method certainly won't win any speed prizes. And I'm not even sure why I did it -- except just to see it work. It relies on the fact that any composite number can be factored but one way into its constituent primes. We begin by assigning each letter of the alphabet a unique prime number. Each unordered string of letters -- a set, really -- can be represented as the product of the primes those letters represent. The difference between two strings, then, will be the quotient of the first product divided by the second, factored into its constituent primes and converted back to letters. This is assuming that the numbers evenly divide, i.e. that the remainder of the division is zero. If not, the string difference is undefined, because it means the second string has letters in it not contained in the first.

    The following program illustrates the technique:

    use Math::BigInt; @Primes = (2,3,5,7,11,13,17,19,23,29,31,37,41, 43,47,53,59,61,67,71,73,79,83,89,97,101); @AsciiPrimes = ((0) x 65, @Primes, (0) x 6, @Primes, (0) x 133); print StringDif('abcdgoldfish', 'flash'),"\n"; print StringDif('lmnogoldfish', 'dish'),"\n"; print StringDif('osar', 'oar'),"\n"; sub StringDif { my ($StrA, $StrB) = @_; my $A = Math::BigInt->new(1); my $B = Math::BigInt->new(1); my $Quot, $Rem, $NewQuot, $ReturnStr; foreach (unpack 'C*', $StrA) {$A *= $AsciiPrimes[$_]} foreach (unpack 'C*', $StrB) {$B *= $AsciiPrimes[$_]} if ($A eq '+0' or $B eq '+0') { warn "Non-letter in string"; return undef } ($Quot, $Rem) = Math::BigInt->new($A)->bdiv($B); return undef unless $Rem eq '+0'; $ReturnStr = ''; foreach('a'..'z') { do { ($NewQuot, $Rem) = Math::BigInt->new($Quot) ->bdiv($AsciiPrimes[ord $_]); if ($Rem eq '+0') { $ReturnStr .= $_; $Quot = $NewQuot } } while $Rem eq '+0' } return $ReturnStr }
    Note that, due to the use of Math::BigInt, strings can be as long as you like. Also, even though the input strings don't have to be sorted, the output string will be.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://122959]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-23 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found