Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

diff (different enough)

by jettero (Monsignor)
on Jul 21, 2000 at 01:34 UTC ( [id://23496]=perlquestion: print w/replies, xml ) Need Help??

jettero has asked for the wisdom of the Perl Monks concerning the following question:

I have to compare several million n-char strings. I have to make sure they're m-different (we call it different enough). I wrote this... but it's abysmally slow, _please_ help me.
sub diff {
    my ( $str1, $str2 )  = @_;
    my $val;
    ($str1, $str2) = ($str2, $str1) if length($str1) > length($str2);
    $val = length($str2) - length($str1);
    return 1 if ( $val > $min_diff );
    my @str1 = split //, $str1;
    my @str2 = split //, $str2;
    for my $i ( 0..$#str1 ) {
        if ( $str1[$i] ne $str2[$i] ) {
            $val++;
        }
        return 1 if ( $val > $min_diff );
    }
    return 0;
}

Replies are listed 'Best First'.
(chromatic) Re: diff (different enough)
by chromatic (Archbishop) on Jul 21, 2000 at 01:41 UTC
    Instead of splitting the two strings, why not use substr?
    for my $i (0 .. length($str1)) { if (substr($str1, $i, 1) ne substr($str2, $i, 1)) { $val++; } }
    That split is bound to be very expensive, both in terms of time and memory.

    You could optimize that further, by looking at five letters at a time:

    my $step = 5; my $bound = int (length($str1) / $step); for my $i (0 .. $bound) { if (substr($str1, ($i * $step), $step) ne (substr($str2, ($i * $st +ep), $step)) { $val++; } }

    Update: If you need to update $val for each character that's different, do something like this:

    for my $i (0 .. $bound) { my $substr1 = substr($str1, ($i * $step), $step); my $substr2 = substr($str2, ($i * $step), $step); if ($substr1 ne $substr2) { for my $j (0 .. ($step - 1)) { if (substr($substr1, $j, 1) ne substr($substr2, $j, 1)) { $val++; } } } }
    Uglier, yes, but still faster than split.
      I have verified that with your new additions the &diff is about 80% faster! Thanx a bunch. Course, it's still abysmally slow, but I'm startin' to think I might need C to do much better.
Re: diff (different enough)
by lhoward (Vicar) on Jul 21, 2000 at 03:08 UTC
    I think the place where you could really get some performance enhancements is in the outer loop. From the sounds of things right now you are comparing every element in your set to every other element in the set (O(n^2)). With a large dataset like you have that will be slow no matter how fast the difference algorithm is.

    If you could use some sort of hueristic to divide the data into subsets, and only compare the values in the subsets to each other. Maybe only comparing every string to the other strings in the set that are within N characters of its length. i.e. if your base string is 10 characters long and you have 2 characters of slack, then only compare your string to the 8 thru 12 character long strings in the original dataset.

    Is the data you are analyzing really arbitrary "n-char strings" or it is something more structured like english text? If it is english text you could use something like Text::Metaphone or Text::Soundex to start off and only comparing things that have simimlar soundex (or metaphone) values. If your data is some other type of data with a semi-regular structure (names, mailing addresses, e-mail addresses, etc..) there are other good hueristics you could use to prevent a BFI cartesian-product comparison as you are doing now.

    While I'm thinking about this, are you loading the whole dataset into memory and processing it at-once? If not and you have the memory to load the whole dataset into RAM then doing so will give you a significant performance increase.

Re: diff (different enough)
by japhy (Canon) on Jul 21, 2000 at 17:13 UTC
    I'd use XOR logic.
    # $m = 5 # $a = "ABCDEFGHIJKL" # $b = "ACHDEFIHAJLK" # $diff = 0 # $xor_ed = "\000\001\013\000\000\000\016\000\010\000\007\007" sub mdiff { my ($m,$a,$b) = @_; my ($la,$lb) = (length $a, length $b); my $xor_ed; my $diff = abs($lb - $la); return 1 if $diff >= $m; return 1 if ($diff + ($xor_ed = $a ^ $b) =~ tr/\0//c) >= $m; }
    Bitwise XOR in strings is a beautiful thing. The number of NULs in the XOR-ed string (\0) will be the number of same characters. The opposite is the number of different characters, so I tr/\0//c.
      that is pretty cool, but for both length()s you have to go through the whole string, and for the tr//c, you have to compare many of the chars again. I tried this one... it's about 3 times slower than my naive string comparison in C. But it's still a bunch faster than Chromatic's solution up there.
        Um, if you mean doing length($str) requires a call to strlen(), you're wrong. From 'perlguts':
        You can get and set the current length of the string stored in an SV with the following macros:
        SvCUR(SV*) SvCUR_set(SV*, I32 val)
        And checking with Devel::Peek:
        jeffp@hut [10:44am] ~ #104> perl -MDevel::Peek $str = "something"; Dump($str); SV = PV(0xa3424) at 0xb4fc0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xb9600 "something"\0 CUR = 9 LEN = 10
        So Perl determines the string length when it's set (which would be when you created the string originally), and uses that value when you call length(). But you're right that you have to go through the entire string when you use tr///. I admit that sucks a bit.

        FYI, unlike C's <char *>, Perl has real strings that know how long they are. So <length> is as fast as getting the integer value of a scalar that already has its integer value cached.

Re: diff (different enough)
by btrott (Parson) on Jul 21, 2000 at 01:38 UTC
    Perhaps take a look at Algorithm::Diff? You could split up your strings, then pass those sequences to the diff function, which should return the diffs. Perhaps you can use that array returned to determine how different the original strings were?

    Here's some code:

    use Algorithm::Diff qw/diff/; my $ref1 = [ split //, "foo bar" ]; my $ref2 = [ split //, "foo baz" ]; my @diffs = diff $ref1, $ref2;
      I think we decided below that splits are very expensive. Which makes it very expensive to use that Differ. It also returns a great deal more info than I need. Which makes me think that it probably takes longer than the substr methods below.
Re: diff (different enough)
by fundflow (Chaplain) on Jul 21, 2000 at 05:18 UTC
    Here is a quick (=unchecked) one. It could be optimized a bit more but this probably gives the idea. I don't know if there's a better Perl way than using substr.
    $l = $#str1>$#str2 ? $#str2 : $#str1; $val = $l-$#str1 + $l-$#str2; for $i ( 1..$l ) { $val += substr($str1, $i, 1) ne substr($str2, $i, 1); last if ($val > $max_diff); } return ($val > $max_diff);
RE: diff (different enough)
by Yaakov (Novice) on Aug 09, 2000 at 16:32 UTC
    Edit:Sorry, I should have read the answers at the top first. This post doesn't tell anything new.

    The following one-line function counts the number of differences in two strings of equal length really quickly:

    sub cdiff {my($a,$b,$c)=@_;($c=$a^$b)=~tr/\0//c;}

    For strings of different length it produces the correct result unless the longer string contains "\0" characters. I am sure you can add the length comparison if you care about this possibility.

    How does it work?

    First of all, read the sections in perlman:perlop about the ^ and the tr/// operators (you can search the long page for "XORed" and "stars" and find the right lines).

    Now, you understand that $c=$a^$b XORs the two strings as long bit sequences: A bit in $c is 0 when the corresponding bits in $a and $b are equal. Otherwise, the bit is 1.

    A byte in $c has eight zeros (that is "\0") when the corresponding bytes in $a and $b are equal. Otherwise, the byte is something else.

    We just need to count the bytes in $c that are not "\0". The tr/\0//c operator does just this.

    Now, this operator really wants to change the string where it counts (it even thinks that this is his purpose). So, we need a dummy variable $c that can be changed.

    That's it.

Re: diff (different enough)
by fundflow (Chaplain) on Jul 21, 2000 at 17:36 UTC
    Very nice solution japhy,
    I was looking for a way to vectorize the checking instead of the for-loop. (otherwise equality has the same cost as xor, when measured in cpu cycles).
    How much faster will it be if perl used MMX for that operation?
RE: diff (different enough)
by Anonymous Monk on Jul 21, 2000 at 19:15 UTC
    I don't know if it would help any, but if you have to diff a million strings, why not split um up into subgroups and fork a process to diff each subgroup (if you aren't doing that already)... I mean if you're forced to brute force something, why not do it the right way, fork fork fork! :) my humble opinion :) - Magius_AR
      At some point, the divide and conquer approach has so much overhead that the simple solutions are better. fork() is especially expensive, as once you get a few processes out there, context switching between them will take more time than plodding along in a for loop (which will probably optimize to a handful of machine code expressions anyway).

      I really like the xor solution someone else pointed out, and you could probably do the 'x characters at a time' on it, too.

        We also considered splitting it up. We couldn't reason out the interprocess communication properly. In actual use, this algorithm get's run every time we insert a new string. So we couldn't figure out how to reliably make sure the kids were all working on mdifferent strings--I mean, we coulda ... we just thought it was a dead end. Unless there _a lot_ of kids (on lotsa computers), it just didn't sound useful.
Re: diff (different enough)
by Anonymous Monk on Jul 21, 2000 at 09:12 UTC
    I have come up with a final solution. I decided that the fastest way to solve this one is to not use perl. This is the very first problem I've encountered since I started with perl that is easier to code in C. ;) Usually it's the other way around. Well, here's my final solution: MDiff MDiff-0.8.tar.gz
      Oh I hate that ... that was me, but I forgot to log in first. :( O, well eh?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://23496]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-18 00:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found