(chromatic) Re: diff (different enough)
by chromatic (Archbishop) on Jul 21, 2000 at 01:41 UTC
|
Instead of splitting the two strings, why not use substr?
for my $i (0 .. length($str1)) {
if (substr($str1, $i, 1) ne substr($str2, $i, 1)) {
$val++;
}
}
That split is bound to be very expensive, both in terms of time and memory.
You could optimize that further, by looking at five letters at a time:
my $step = 5;
my $bound = int (length($str1) / $step);
for my $i (0 .. $bound) {
if (substr($str1, ($i * $step), $step) ne (substr($str2, ($i * $st
+ep), $step)) {
$val++;
}
}
Update: If you need to update $val for each character that's different, do something like this:
for my $i (0 .. $bound) {
my $substr1 = substr($str1, ($i * $step), $step);
my $substr2 = substr($str2, ($i * $step), $step);
if ($substr1 ne $substr2) {
for my $j (0 .. ($step - 1)) {
if (substr($substr1, $j, 1) ne
substr($substr2, $j, 1)) {
$val++;
}
}
}
}
Uglier, yes, but still faster than split. | [reply] [d/l] [select] |
|
I have verified that with your new additions the
&diff is about 80% faster! Thanx a bunch. Course, it's
still abysmally slow, but I'm startin' to think I might
need C to do much better.
| [reply] |
Re: diff (different enough)
by lhoward (Vicar) on Jul 21, 2000 at 03:08 UTC
|
I think the place where you could really get some
performance enhancements is in the outer loop.
From the sounds of things right now you are
comparing every element in your set to every other
element in the set (O(n^2)). With a large dataset like
you have that will be slow
no matter how fast the difference algorithm is.
If you
could use some sort of hueristic to divide the data into
subsets, and only compare the values in the subsets to
each other. Maybe only comparing every string to the other
strings in the set that are within N characters of its length.
i.e. if your base string is 10 characters long and you have
2 characters of slack, then only compare your string to the
8 thru 12 character long strings in the original dataset.
Is the data you are analyzing really arbitrary
"n-char strings" or it is something more structured like
english text? If it is english text you could use something like Text::Metaphone
or Text::Soundex to start off and only comparing things that have
simimlar soundex (or metaphone) values.
If your data is some other type of data with a semi-regular
structure (names, mailing addresses, e-mail addresses, etc..)
there are other good hueristics you could use to
prevent a BFI cartesian-product comparison as you are doing now.
While I'm thinking about this, are you loading the whole
dataset into memory and processing it at-once? If not
and you have the memory to load the whole dataset
into RAM then doing so will give you a
significant performance increase. | [reply] |
Re: diff (different enough)
by japhy (Canon) on Jul 21, 2000 at 17:13 UTC
|
# $m = 5
# $a = "ABCDEFGHIJKL"
# $b = "ACHDEFIHAJLK"
# $diff = 0
# $xor_ed = "\000\001\013\000\000\000\016\000\010\000\007\007"
sub mdiff {
my ($m,$a,$b) = @_;
my ($la,$lb) = (length $a, length $b);
my $xor_ed;
my $diff = abs($lb - $la);
return 1 if $diff >= $m;
return 1 if ($diff + ($xor_ed = $a ^ $b) =~ tr/\0//c) >= $m;
}
Bitwise XOR in strings is a beautiful thing. The number of NULs in the XOR-ed string (\0) will be the number of same characters. The opposite is the number of different characters, so I tr/\0//c. | [reply] [d/l] |
|
that is pretty cool, but for both length()s you have to go
through the whole string, and for the tr//c, you have to
compare many of the chars again. I tried this one...
it's about 3 times slower than my naive string comparison
in C. But it's still a bunch faster than Chromatic's
solution up there.
| [reply] |
|
jeffp@hut [10:44am] ~ #104> perl -MDevel::Peek
$str = "something";
Dump($str);
SV = PV(0xa3424) at 0xb4fc0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb9600 "something"\0
CUR = 9
LEN = 10
So Perl determines the string length when it's set (which would be when you created the string originally), and uses that value when you call length(). But you're right that you have to go through the entire string when you use tr///. I admit that sucks a bit. | [reply] [d/l] [select] |
|
| [reply] |
Re: diff (different enough)
by btrott (Parson) on Jul 21, 2000 at 01:38 UTC
|
Perhaps take a look at Algorithm::Diff? You could
split up your strings, then pass those sequences to the
diff function, which should return the diffs. Perhaps
you can use that array returned to determine how different
the original strings were?
Here's some code:
use Algorithm::Diff qw/diff/;
my $ref1 = [ split //, "foo bar" ];
my $ref2 = [ split //, "foo baz" ];
my @diffs = diff $ref1, $ref2;
| [reply] [d/l] |
|
I think we decided below that splits are very expensive.
Which makes it very expensive to use that Differ.
It also returns a great deal more info than I need.
Which makes me think that it probably takes longer than
the substr methods below.
| [reply] |
Re: diff (different enough)
by fundflow (Chaplain) on Jul 21, 2000 at 05:18 UTC
|
Here is a quick (=unchecked) one. It could be optimized a bit more but
this probably gives the idea. I don't know if there's a
better Perl way than using substr.
$l = $#str1>$#str2 ? $#str2 : $#str1;
$val = $l-$#str1 + $l-$#str2;
for $i ( 1..$l ) {
$val += substr($str1, $i, 1) ne substr($str2, $i, 1);
last if ($val > $max_diff);
}
return ($val > $max_diff);
| [reply] [d/l] |
RE: diff (different enough)
by Yaakov (Novice) on Aug 09, 2000 at 16:32 UTC
|
Edit:Sorry, I should have read the answers
at the top first. This post doesn't tell anything new.
The following one-line function counts the number of
differences in two strings of equal length really quickly:
sub cdiff {my($a,$b,$c)=@_;($c=$a^$b)=~tr/\0//c;}
For strings of different length it produces the correct
result unless the longer string contains "\0" characters.
I am sure you can add the length comparison if you care
about this possibility.
How does it work?
First of all, read the sections in perlman:perlop about
the ^ and the tr/// operators (you can search the long page
for "XORed" and "stars" and find the right lines).
Now, you understand that $c=$a^$b XORs the
two strings as long bit sequences: A bit in $c
is 0 when the corresponding bits in $a and
$b are equal. Otherwise, the bit is 1.
A byte in $c has eight zeros (that is
"\0") when the corresponding bytes in
$a and $b are equal. Otherwise,
the byte is something else.
We just need to count the bytes in $c that are
not "\0". The tr/\0//c operator
does just this.
Now, this operator really wants to change the string where
it counts (it even thinks that this is his purpose). So,
we need a dummy variable $c that can be
changed.
That's it. | [reply] [d/l] [select] |
Re: diff (different enough)
by fundflow (Chaplain) on Jul 21, 2000 at 17:36 UTC
|
Very nice solution japhy,
I was looking for a way to vectorize the checking instead of
the for-loop. (otherwise equality has the same cost as xor,
when measured in cpu cycles).
How much faster will it be if perl used MMX
for that operation? | [reply] |
RE: diff (different enough)
by Anonymous Monk on Jul 21, 2000 at 19:15 UTC
|
I don't know if it would help any, but if you have to diff a million strings, why not split um up into subgroups and fork a process to diff each subgroup (if you aren't doing that already)...
I mean if you're forced to brute force something, why not do it the right way, fork fork fork! :)
my humble opinion :)
- Magius_AR | [reply] |
|
| [reply] |
|
We also considered splitting it up. We couldn't reason
out the interprocess communication properly. In actual
use, this algorithm get's run every time we insert a new
string. So we couldn't figure out how to reliably make
sure the kids were all working on mdifferent strings--I mean,
we coulda ... we just thought it was a dead end.
Unless there _a lot_ of kids (on lotsa computers), it just
didn't sound useful.
| [reply] |
Re: diff (different enough)
by Anonymous Monk on Jul 21, 2000 at 09:12 UTC
|
I have come up with a final solution.
I decided that the fastest way to solve this one is
to not use perl. This is the very first problem I've
encountered since I started with perl that is easier
to code in C. ;) Usually it's the other way around.
Well, here's my final solution:
MDiff
MDiff-0.8.tar.gz | [reply] |
|
Oh I hate that ... that was me, but I forgot to
log in first. :( O, well eh?
| [reply] |