Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Optimizing a string processing sub

by spurperl (Priest)
on Jan 08, 2003 at 19:23 UTC ( #225333=perlquestion: print w/ replies, xml ) Need Help??
spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I'm looking for suggestions to optimize the following task:
Given two words, find how many characters they share. For instance, "help" and "temp" share 2 characters, "monk" and "perl" share 0, etc...
First, I implemented the following simple version:
sub score { my ($word1, $word2) = @_; # return a special value (defined somewhere in the file # if we have an exact match) return $words_equal if ($word1 eq $word2); my @chars1 = split(//, $word1); my @chars2 = split(//, $word2); my $count = 0; foreach $a (@chars1) { foreach $b (@chars2) { if ($a eq $b) { $count++; last; } } } return $count; }
Then, I tried optimizing it. The following is a regexp version:
sub score2 { my ($word1, $word2) = @_; return $words_equal if ($word1 eq $word2); my @chars1 = split(//, $word1); my $count = 0; foreach $a (@chars1) { if ($word2 =~ /$a/) { $count++; last; } } return $count; }
Benchmarking proved that the second version is about 50% slower than the first & simple one.

Any ideas on how this can be made faster ?

TIA, Spurperl

Comment on Optimizing a string processing sub
Select or Download Code
Re: Optimizing a string processing sub
by Fletch (Chancellor) on Jan 08, 2003 at 19:32 UTC
    use Inline => C => 'EOT'; ...

    Update: Sarcasm aside, the problem statement is somewhat ambiguous. Do you mean have the same characters in the same positions, or have somewhere in them the same character (e.g. is `foop' and 'poof' a 2 or a 4 (or a 3 if multiple occurances of the same character only count once))?

      Here's your ...:

      use Inline C => <<EOT; UV score(char* a, char* b) { /* it doesn't like const... */ char mark[256] = { 0 }, * i; UV sc = 0; for(i = a; *i; mark[*i++] = 1); for(i = b; *i; i++) if(mark[*i]) sc += mark[*i]--; return sc; } EOT

      Yep, you're right, it won't work for Unicode. Modification to do so is left as an excercise for the diligent student.

      czth

Re: Optimizing a string processing sub
by dragonchild (Archbishop) on Jan 08, 2003 at 19:49 UTC
    Try something like this:
    sub score { my ($x, $y) = @_; return $words_equal if $x eq $y; return 0 unless $x && $y; @{[$x =~ /[$y]/g]}; }
    Basically, I'm treating $y as a character class to match on $x.

    Now - there is a major assumption with this code - both words don't have characters twice. Thus, 'perl' and 'temp' is ok. But, 'perl' and 'etemp' isn't and neither is 'eperl' and 'temp'. Depending on which is first, the answer will be different. (The reason is that the characters in $y are treated once, but the characters in $x are treated as many times as it appears.)

    I justify this cause you don't specify how to deal with multiple characters. If you are looking for just the number of unique characters they share, regardless of how many times it appears in either word, then do something like:

    sub score { # Same as above, until: my $cnt1 = @{[$x =~ /[$y]/g]}; my $cnt2 = @{[$y =~ /[$x]/g]}; $cnt1 < $cnt2 ? $cnt1 : $cnt2; }

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Many thanks, dragonchild, ++.

      Your code works correctly and faster.
      Could you please explain how exactly it works ? (the @{[ part)
        The @{[]} is a trick to force list context. The reason why list context is important is because of what the regex operator returns. In list context, the regex operator returns a list of what is matched. In scalar context, the regex operator returns whether or not it matched. (In theory, that should be the number of things matched as well, but I couldn't get it to work.)

        So, by forcing list context, I get the list of things matched. Then, by converting the list to scalar context, I get the number of things in the list.

        I'm sure there's a more elegant way than creating two array references on the fly, but that's what I had in 60sec of imperfect memory.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      I like your approach.

      Your first example does exactly what his does with one exception: it won't behave the same in list context.

      I suspect that your second example is really the desired behavior anyway (as it is commutative.)

      -sauoq
      "My two cents aren't worth a dime.";
      

      Unfortunately (unfortunate because it is clever and short) -- I believe your solution is flawed.

      Using the words 'aabcc' and 'abbbc', your code suggests that the words have 5 characters in common. The original poster was vague about the requirements, however, I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected.

      With this in mind, I suggest that your first example is correct if the goal is to determine the sum of the number of characters that is in each word that are also in the other word (probably not what the original posted wanted to see), and that the second does not do what you expect.

      Good luck tweaking your solution... :-)

        I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected.

        The case where the words are equal is a special one in the original code and results in $words_equal being returned. Given that fact, I think that 5 would be expected.

        -sauoq
        "My two cents aren't worth a dime.";
        
        You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5.

        Sorry for being vague about the requirements... They really are vague ! Maybe examples can help:

        aabcc, abbbc -> 3 (once a, once b and once c, that's it)
        stress, super -> 3 (once s, e and r)
        abcde, caebd -> 5 (permutation)
        abcde, caebdxxy -> 5 (doesn't change things)
      I was thinking along similar lines, but using tr// in its count-them mode. $cnt1= $x =~ eval "tr/$y//";

      Hmm, the tr doesn't do interpolation. So it trades the @{[ for an eval.

Re: Optimizing a string processing sub
by blokhead (Monsignor) on Jan 08, 2003 at 19:56 UTC
    Not sure if this is faster, but it might be, as it doesn't use nested loops.
    sub score { my ($word1, $word2) = @_; my (%chars1, %chars2) = (); $chars1{$_}++ for split '', $word1; $chars2{$_}++ for split '', $word2; # the minimum of the two hashes is the number in common for each l +etter my $sum = 0; $sum += ($chars1{$_} < $chars2{$_} ? $chars1{$_} : $chars2{$_}) for keys %chars1; return $sum; } while (<DATA>) { chomp; print "$_: " . score(split /\s+/) . " in common\n"; } __DATA__ perl monk help temp frood hoopy bilbo baggins jibber jaber
    This prints:
    perl monk: 0 in common help temp: 2 in common frood hoopy: 2 in common bilbo baggins: 2 in common jibber jabber: 5 in common

    I notice your algorithms give 3 matches for 'bilbo' and 'baggins'. I think this is because both 'b's in bilbo match inside baggins. I'm not sure if this is correct behavior by your specifications or not.

    Update: To speed up your score2 sub, consider using index($word2, $a) > 0 instead of the regex match. Changing this alone made it approximately as fast as your initial score sub for me.

    blokhead

Re: Optimizing a string processing sub
by boo_radley (Parson) on Jan 08, 2003 at 20:57 UTC
    This is a variation on a faq -- see perlfaq 4, "How can I count the number of occurrences of a substring within a string?".
    Using a little eval trickery :
    foreach (<DATA>) { my ($w1, $w2) = split / /, $_; my %common = (); foreach ( split //, $w1){ eval "\$foo=\$w2=~tr/$_/$_/" unless exists $common{$_}; $common {$_}=1 if $foo; } print "$w1 and $w2 have ", join (", ",sort keys %common)," in comm +on\n"; } __DATA__ perl monk help temp frood hoopysdf bilbo baggins jibber jabber

    I wonder about your algorithm in score2, though. It seems to return 1 regardless of actual matches because of the last. Perhaps you need a larger/ more complex word set?
Re: Optimizing a string processing sub
by sauoq (Abbot) on Jan 08, 2003 at 21:18 UTC

    Here's another way to do it:

    sub score { my ($x, $y) = @_; return $words_equal if ($x eq $y); my $c = 0; index($y, $_)+1 && $c++ for split //, $x; return $c; }

    The real question is whether your code is actually doing what you expect. Your function is not commutative. For example, score('quuux', 'quack') does not return the same thing as score('quack', 'quuux') does. (The first is 4 the second is 2.)

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Optimizing a string processing sub
by bart (Canon) on Jan 08, 2003 at 21:25 UTC
    Do you mean in the same position, or just the same characters? Your examples suggest the former, but your code the latter. Well, for the former, simple enough, this would be my idea:
    ($word1 ^ $word2) =~ tr/\0//
    Do a bitwise XOR of the words. There where they have the same letters, the resulting character will be a "\0". Next, count these occurrences.

    As for the latter, even though it looks simpler as a requirement, as an implementation, it is not. So I'll take a completely different approach.

    my %bag; my $total = 0; foreach(split '', $word1) { $bag{$_}++; } foreach(split '', $word2) { if($bag{$_}) { $bag{$_}--; $total++; } } return $total;
    In English: cut the first word into characters. Put them in a bag. For each character in the second word, pick it out of the bag, at least, if there's at least one of it left in the bag. Count the number of characters taken from the bag.
      I meant the latter, and your solution works well (see my 2nd reply to dragonchild above for some clarifications).
      It also feels "right". The problem is when some char X appears in word2 and then twice in word1... Using a hash is a clever solution to this problem. I've been trying to run your algorithm on some examples, and in my head, and didn't break it yet.
      It also works about 50% faster than my original version.
Re: Optimizing a string processing sub
by clairudjinn (Beadle) on Jan 09, 2003 at 06:21 UTC
    untested. idea was to extend functionality to as many words as passed, and get a return that was a bit more informative than just the number of shared characters. is the idea ok, code aside?
    $class = [a-z]; @commonChars = compareChars( $class ); sub compareChars { $regex = shift; $requiredCount = @ARGV; for ( $i=0; $i<=$#ARGV; $i++ ) { while ( $ARGV[$i] =~ m/($regex)/gi ) { $found{$1} = $i + 1; } } while ( ( $key, $value ) = each %found ) { push @answer, $key if $value == $requiredCount; } return @answer; }

      Your code match the first occurrence of [a-z] in every element in @ARGV, and returns occurrences that occur at least as many times as elements in @ARGV. Sorry -- this is nothing like the original requirements.

      P.S. Helpful hint: Even if you are only determining whether the concept you are trying for is correct, any posted code that is not in the form of pseudo-code should be tested. You would have found that your code did not meet the requirements without having had to ask. Cheers.

      UPDATE: As per the followup by Anonymous Monk, I did miss the inner while(){}. The only line that appears to still be wrong is the line that reads "$found{$1} = $i + 1;". The effect appears to be that only characters that show up in the last element of @ARGV will be returned. A small adjustment that would perhaps allow this code to work would be to replace the faulty line with "$found{$1}++;". One good addition that this code suggests is an in-case-sensitive match. In order for this to work however, the faulty line will need to read "$found{lc $1}++;" to ensure that the correct hash element is incremented.

        As far as I understand it, for a character to be common to all words being scanned, it has to occur at least once in each word. If @ARGV contains 3 words to be scanned, for example, then we are only interested in characters than occur a minimum of once per word for all three words, or three times. That's why when a character is found in a given word for the first time, it's "score" is incremented by 1 to bring the cumulative score to the word number +1 (since arrays start at 0). Other occurences of the same character in the same word are effectively ignored since we don't care. Only characters that score 3 are returned. I think this theory does satisfy the original requirements actually, even if the code itself is buggy...
      No idea if this is faster, but I just wanted to post a version of my first code that works, tested, for any number of arguments:
      #!/usr/bin/perl use strict; use warnings; my $charClass = '[a-z]'; #change as desired my @commonChars = compareChars(); print "Num common chars: ",scalar @commonChars,"\n"; ### sub assumes case insensitivity is appropriate ### sub compareChars { my $requiredCount = @ARGV; my %found; my @answer; for my $word ( @ARGV ) { my %nonredundantChars = map { $_ => 1 } split //, $word; $word = join '', keys %nonredundantChars; $found{lc $1}++ while ( $word =~ m/($charClass)/gi ) } while ( ( my $char, my $count ) = each %found ) { push @answer, $char if $count == $requiredCount; } return @answer; }
Re: Optimizing a string processing sub
by hardburn (Abbot) on Jan 09, 2003 at 15:06 UTC

    Benchmarking proved that the second version is about 50% slower than the first & simple one.

    Intresting. I know regexes are expensive, but I expect that the second version will scale better. If the number of chars in $word1 is N1, and the number of chars in $word2 is N2, then I think the first version would be O(N1N2)--pretty bad. The second one should be simply O(N1), though it depends on how the regex is implemented internaly.

Re: Optimizing a string processing sub
by Dr. Mu (Hermit) on Jan 09, 2003 at 18:25 UTC
    I'm not sure this is any faster for the two-word case. It does seem simpler, though:
    use strict; while (<DATA>) { chomp; print "$_: ", compare(split / /, $_), " in common.\n" } sub compare { my ($word1, $word2) = @_; my $n = length($word2); $word2 =~ s/$_// foreach split //, $word1; return $n - length($word2) } __DATA__ perl monk help temp frood hoopy bilbo baggins jibber jaber aabcc abbbc
    It works by cycling through $word1 one character at a time and removing the first occurance of that character from $word2. When finished, $word2 will be shorter by the number of matches with $word1.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://225333]
Approved by IlyaM
Front-paged by Enlil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2014-11-29 05:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (203 votes), past polls