Optimizing a string processing sub

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I'm looking for suggestions to optimize the following task:
Given two words, find how many characters they share. For instance, "help" and "temp" share 2 characters, "monk" and "perl" share 0, etc...

First, I implemented the following simple version:

sub score
{
    my ($word1, $word2) = @_;

    # return a special value (defined somewhere in the file
    # if we have an exact match)
    return $words_equal if ($word1 eq $word2);

    my @chars1 = split(//, $word1);
    my @chars2 = split(//, $word2);
    my $count = 0;

    foreach $a (@chars1)
    {
        foreach $b (@chars2)
        {
            if ($a eq $b)
            {
                $count++;
                last;
            }
        }
    }

    return $count;
}
[download]

Then, I tried optimizing it. The following is a regexp version:

sub score2
{
    my ($word1, $word2) = @_;

    return $words_equal if ($word1 eq $word2);

    my @chars1 = split(//, $word1);
    my $count = 0;

    foreach $a (@chars1)
    {
        if ($word2 =~ /$a/)
        {
            $count++;
            last;
        }
    }

    return $count;
}
[download]

Benchmarking proved that the second version is about 50% slower than the first & simple one.

Any ideas on how this can be made faster ?

TIA, Spurperl

Comment on Optimizing a string processing sub Select or Download Code

Replies are listed 'Best First'.
Re: Optimizing a string processing sub by dragonchild (Archbishop) on Jan 08, 2003 at 19:49 UTC
Try something like this: `sub score { my ($x, $y) = @_; return $words_equal if $x eq $y; return 0 unless $x && $y; @{[$x =~ /[$y]/g]}; }` [download] Basically, I'm treating $y as a character class to match on $x. Now - there is a major assumption with this code - both words don't have characters twice. Thus, 'perl' and 'temp' is ok. But, 'perl' and 'etemp' isn't and neither is 'eperl' and 'temp'. Depending on which is first, the answer will be different. (The reason is that the characters in $y are treated once, but the characters in $x are treated as many times as it appears.) I justify this cause you don't specify how to deal with multiple characters. If you are looking for just the number of unique characters they share, regardless of how many times it appears in either word, then do something like: `sub score { # Same as above, until: my $cnt1 = @{[$x =~ /[$y]/g]}; my $cnt2 = @{[$y =~ /[$x]/g]}; $cnt1 < $cnt2 ? $cnt1 : $cnt2; }` [download] ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l] [select]
Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 22:10 UTC
Unfortunately (unfortunate because it is clever and short) -- I believe your solution is flawed. Using the words 'aabcc' and 'abbbc', your code suggests that the words have 5 characters in common. The original poster was vague about the requirements, however, I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected. With this in mind, I suggest that your first example is correct if the goal is to determine the sum of the number of characters that is in each word that are also in the other word (probably not what the original posted wanted to see), and that the second does not do what you expect. Good luck tweaking your solution... :-)	[reply]
Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 22:36 UTC
I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected. The case where the words are equal is a special one in the original code and results in `$words_equal` being returned. Given that fact, I think that 5 would be expected. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 23:15 UTC
Re: Re: Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 23:54 UTC
Re: Re: Re: Optimizing a string processing sub by spurperl (Priest) on Jan 09, 2003 at 05:53 UTC
You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5. Sorry for being vague about the requirements... They really are vague ! Maybe examples can help: aabcc, abbbc -> 3 (once a, once b and once c, that's it) stress, super -> 3 (once s, e and r) abcde, caebd -> 5 (permutation) abcde, caebdxxy -> 5 (doesn't change things)	[reply]
Re: Re: Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 09, 2003 at 06:20 UTC
Re: Re: Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 09, 2003 at 08:31 UTC
Re5: Optimizing a string processing sub by dragonchild (Archbishop) on Jan 09, 2003 at 15:11 UTC
Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 21:36 UTC
I like your approach. Your first example does exactly what his does with one exception: it won't behave the same in list context. I suspect that your second example is really the desired behavior anyway (as it is commutative.) -sauoq "My two cents aren't worth a dime.";	[reply]
Re: Re: Optimizing a string processing sub by spurperl (Priest) on Jan 08, 2003 at 20:49 UTC
Many thanks, dragonchild, ++. Your code works correctly and faster. Could you please explain how exactly it works ? (the @{[ part)	[reply]
Re3: Optimizing a string processing sub by dragonchild (Archbishop) on Jan 08, 2003 at 21:00 UTC
The `@{[]}` is a trick to force list context. The reason why list context is important is because of what the regex operator returns. In list context, the regex operator returns a list of what is matched. In scalar context, the regex operator returns whether or not it matched. (In theory, that should be the number of things matched as well, but I couldn't get it to work.) So, by forcing list context, I get the list of things matched. Then, by converting the list to scalar context, I get the number of things in the list. I'm sure there's a more elegant way than creating two array references on the fly, but that's what I had in 60sec of imperfect memory. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]
Re: Re3: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 21:27 UTC
Re: Re3: Optimizing a string processing sub by chromatic (Archbishop) on Jan 08, 2003 at 22:02 UTC
Re: Re3: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 22:04 UTC
Re: Re: Optimizing a string processing sub by John M. Dlugosz (Monsignor) on Jan 10, 2003 at 16:21 UTC
I was thinking along similar lines, but using tr// in its count-them mode. `$cnt1= $x =~ eval "tr/$y//";` Hmm, the tr doesn't do interpolation. So it trades the @{[ for an eval.	[reply] [d/l]
Re: Optimizing a string processing sub by bart (Canon) on Jan 08, 2003 at 21:25 UTC
Do you mean in the same position, or just the same characters? Your examples suggest the former, but your code the latter. Well, for the former, simple enough, this would be my idea: `($word1 ^ $word2) =~ tr/\0//` [download] Do a bitwise XOR of the words. There where they have the same letters, the resulting character will be a "\0". Next, count these occurrences. As for the latter, even though it looks simpler as a requirement, as an implementation, it is not. So I'll take a completely different approach. `my %bag; my $total = 0; foreach(split '', $word1) { $bag{$_}++; } foreach(split '', $word2) { if($bag{$_}) { $bag{$_}--; $total++; } } return $total;` [download] In English: cut the first word into characters. Put them in a bag. For each character in the second word, pick it out of the bag, at least, if there's at least one of it left in the bag. Count the number of characters taken from the bag.	[reply] [d/l] [select]
Re: Re: Optimizing a string processing sub by spurperl (Priest) on Jan 09, 2003 at 06:00 UTC
I meant the latter, and your solution works well (see my 2nd reply to dragonchild above for some clarifications). It also feels "right". The problem is when some char X appears in word2 and then twice in word1... Using a hash is a clever solution to this problem. I've been trying to run your algorithm on some examples, and in my head, and didn't break it yet. It also works about 50% faster than my original version.	[reply]
Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 21:18 UTC
Here's another way to do it: `sub score { my ($x, $y) = @_; return $words_equal if ($x eq $y); my $c = 0; index($y, $_)+1 && $c++ for split //, $x; return $c; }` [download] The real question is whether your code is actually doing what you expect. Your function is not commutative. For example, `score('quuux', 'quack')` does not return the same thing as `score('quack', 'quuux')` does. (The first is 4 the second is 2.) -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re: Optimizing a string processing sub by blokhead (Monsignor) on Jan 08, 2003 at 19:56 UTC
Not sure if this is faster, but it might be, as it doesn't use nested loops. `sub score { my ($word1, $word2) = @_; my (%chars1, %chars2) = (); $chars1{$_}++ for split '', $word1; $chars2{$_}++ for split '', $word2; # the minimum of the two hashes is the number in common for each l +etter my $sum = 0; $sum += ($chars1{$_} < $chars2{$_} ? $chars1{$_} : $chars2{$_}) for keys %chars1; return $sum; } while (<DATA>) { chomp; print "$_: " . score(split /\s+/) . " in common\n"; } __DATA__ perl monk help temp frood hoopy bilbo baggins jibber jaber` [download] This prints: `perl monk: 0 in common help temp: 2 in common frood hoopy: 2 in common bilbo baggins: 2 in common jibber jabber: 5 in common` [download] I notice your algorithms give 3 matches for 'bilbo' and 'baggins'. I think this is because both 'b's in bilbo match inside baggins. I'm not sure if this is correct behavior by your specifications or not. Update: To speed up your `score2` sub, consider using `index($word2, $a) > 0` instead of the regex match. Changing this alone made it approximately as fast as your initial `score` sub for me. blokhead	[reply] [d/l] [select]
Re: Optimizing a string processing sub by boo_radley (Parson) on Jan 08, 2003 at 20:57 UTC
This is a variation on a faq -- see perlfaq 4, "How can I count the number of occurrences of a substring within a string?". Using a little eval trickery : `foreach (<DATA>) { my ($w1, $w2) = split / /, $_; my %common = (); foreach ( split //, $w1){ eval "\$foo=\$w2=~tr/$_/$_/" unless exists $common{$_}; $common {$_}=1 if $foo; } print "$w1 and $w2 have ", join (", ",sort keys %common)," in comm +on\n"; } __DATA__ perl monk help temp frood hoopysdf bilbo baggins jibber jabber` [download] I wonder about your algorithm in score2, though. It seems to return 1 regardless of actual matches because of the `last`. Perhaps you need a larger/ more complex word set?	[reply] [d/l]
Re: Optimizing a string processing sub by Fletch (Bishop) on Jan 08, 2003 at 19:32 UTC
`use Inline => C => 'EOT'; ...` [download] Update: Sarcasm aside, the problem statement is somewhat ambiguous. Do you mean have the same characters in the same positions, or have somewhere in them the same character (e.g. is `foop' and 'poof' a 2 or a 4 (or a 3 if multiple occurances of the same character only count once))?	[reply] [d/l]
Re: Re: Optimizing a string processing sub by czth (Sexton) on Jan 09, 2003 at 17:05 UTC
Here's your `...`: `use Inline C => <<EOT; UV score(char* a, char* b) { /* it doesn't like const... / char mark[256] = { 0 }, i; UV sc = 0; for(i = a; i; mark[i++] = 1); for(i = b; i; i++) if(mark[i]) sc += mark[i]--; return sc; } EOT` [download] Yep, you're right, it won't work for Unicode. Modification to do so is left as an excercise for the diligent student. czth*	[reply] [d/l]
Re: Optimizing a string processing sub by clairudjinn (Beadle) on Jan 09, 2003 at 06:21 UTC
untested. idea was to extend functionality to as many words as passed, and get a return that was a bit more informative than just the number of shared characters. is the idea ok, code aside? `$class = �[a-z]�; @commonChars = compareChars( $class ); sub compareChars { $regex = shift; $requiredCount = @ARGV; for ( $i=0; $i<=$#ARGV; $i++ ) { while ( $ARGV[$i] =~ m/($regex)/gi ) { $found{$1} = $i + 1; } } while ( ( $key, $value ) = each %found ) { push @answer, $key if $value == $requiredCount; } return @answer; }` [download]	[reply] [d/l]
Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 09, 2003 at 06:49 UTC
Your code match the first occurrence of [a-z] in every element in @ARGV, and returns occurrences that occur at least as many times as elements in @ARGV. Sorry -- this is nothing like the original requirements. P.S. Helpful hint: Even if you are only determining whether the concept you are trying for is correct, any posted code that is not in the form of pseudo-code should be tested. You would have found that your code did not meet the requirements without having had to ask. Cheers. UPDATE: As per the followup by Anonymous Monk, I did miss the inner while(){}. The only line that appears to still be wrong is the line that reads "$found{$1} = $i + 1;". The effect appears to be that only characters that show up in the last element of @ARGV will be returned. A small adjustment that would perhaps allow this code to work would be to replace the faulty line with "$found{$1}++;". One good addition that this code suggests is an in-case-sensitive match. In order for this to work however, the faulty line will need to read "$found{lc $1}++;" to ensure that the correct hash element is incremented.	[reply]
Re: Re: Re: Optimizing a string processing sub by Anonymous Monk on Jan 09, 2003 at 06:59 UTC
As far as I understand it, for a character to be common to all words being scanned, it has to occur at least once in each word. If @ARGV contains 3 words to be scanned, for example, then we are only interested in characters than occur a minimum of once per word for all three words, or three times. That's why when a character is found in a given word for the first time, it's "score" is incremented by 1 to bring the cumulative score to the word number +1 (since arrays start at 0). Other occurences of the same character in the same word are effectively ignored since we don't care. Only characters that score 3 are returned. I think this theory does satisfy the original requirements actually, even if the code itself is buggy...	[reply]
Re: Re: Optimizing a string processing sub by clairudjinn (Beadle) on Jan 10, 2003 at 20:44 UTC
No idea if this is faster, but I just wanted to post a version of my first code that works, tested, for any number of arguments: #!/usr/bin/perl use strict; use warnings; my $charClass = '[a-z]'; #change as desired my @commonChars = compareChars(); print "Num common chars: ",scalar @commonChars,"\n"; ### sub assumes case insensitivity is appropriate ### sub compareChars { my $requiredCount = @ARGV; my %found; my @answer; for my $word ( @ARGV ) { my %nonredundantChars = map { $_ => 1 } split //, $word; $word = join '', keys %nonredundantChars; $found{lc $1}++ while ( $word =~ m/($charClass)/gi ) } while ( ( my $char, my $count ) = each %found ) { push @answer, $char if $count == $requiredCount; } return @answer; } [download]	[reply] [d/l]
Re: Optimizing a string processing sub by hardburn (Abbot) on Jan 09, 2003 at 15:06 UTC
Benchmarking proved that the second version is about 50% slower than the first & simple one. Intresting. I know regexes are expensive, but I expect that the second version will scale better. If the number of chars in $word1 is N₁, and the number of chars in $word2 is N₂, then I think the first version would be O(N₁N₂)--pretty bad. The second one should be simply O(N₁), though it depends on how the regex is implemented internaly.	[reply]
Re: Optimizing a string processing sub by Dr. Mu (Hermit) on Jan 09, 2003 at 18:25 UTC
I'm not sure this is any faster for the two-word case. It does seem simpler, though: `use strict; while (<DATA>) { chomp; print "$_: ", compare(split / /, $_), " in common.\n" } sub compare { my ($word1, $word2) = @_; my $n = length($word2); $word2 =~ s/$_// foreach split //, $word1; return $n - length($word2) } __DATA__ perl monk help temp frood hoopy bilbo baggins jibber jaber aabcc abbbc` [download] It works by cycling through $word1 one character at a time and removing the first occurance of that character from $word2. When finished, $word2 will be shorter by the number of matches with $word1.	[reply] [d/l]

Back to Seekers of Perl Wisdom