Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Comparison by position and value

by sgifford (Prior)
on Jan 04, 2005 at 08:39 UTC ( #419179=note: print w/replies, xml ) Need Help??


in reply to Comparison by position and value

Here's a solution that is very fast and manages to use string-wise boolean operations. It uses a data structure that's a little bit complicated, but since converting all the data structures is linear to the number of strings you have and you're doing the comparisons many more times than that, I think it will still offer a significant speedup.

The basic idea is this:

It's easy and fast to check whether there are different digits at the same position; you can use some kind of a mask to ignore the _'s, and then just compare the masked strings with eq. To compare two strings $a and $b, following aristotle's example we generate a mask with tr/\0/\377/c, then test:
($a & $b_mask) eq ($b & $a_mask)

It's checking whether a character appears elsewhere in the string that's slow; there's not a straightforward way to do it without looking at each character individually. But if we have a string telling at which position each character occurs, we can use a similar masking technique to see whether the same character occurs in different positions in two strings. We can get this by doing something similar to transposing a one-dimensional matrix.

Let me illustrate with an example.

If we have the string _8__3__19, we can "transpose" it into the string _7_4____18, meaning that 0, 2, 4, 5, 6, and 7 don't appear in the string at all; 1 appears at position 7, 3 appears at position 4, 8 appears at position 1, and 9 appears at position 8. Similarly, we can "transpose" 48____7__ to ____0__61_.

Now we can use the same mask-and-compare technique aristotle gave us to compare the "transposed" values, and get a very fast answer. This works because if any digit appears in only one string, it will be masked out; if it appears in both, it must be at the same position, so after masking away everything that appears in only one of the strings, what's left must be equal.

Since we can compare the strings and the "transposed" strings the same way, we can simply concatenate them together and store them with their masks for the data structure. I used an arrayref with [$orig_str, $str_and_xposed, $mask], with $str_and_xposed having underscores changed to \0.

With this data structure, the entire test becomes:

# a and mask[b] eq b and mask[a] ($_[0][1] & $_[1][2]) eq ($_[1][1] & $_[0][2]);

Here's how it benchmarks. simple3 is a simple substr-based implementation, csimple3 is an Inline::C implementation of simple3, clever2 is the above code, clever3 is the same code but doing the data structure transformations beforehand, cclever3 is an Inline::C implementation of clever3. I've actually benchmarked several other promising solutions in this thread, and this is by far the fastest.

Benchmark: timing 25000 iterations of cclever3, clever2, clever3, csim +ple3, simple3... cclever3: 1 wallclock secs ( 0.95 usr + 0.00 sys = 0.95 CPU) @ 26315.79/s (n=25000) clever2: 10 wallclock secs ( 8.73 usr + 0.02 sys = 8.75 CPU) @ 2857.14/s (n=25000) clever3: 1 wallclock secs ( 1.15 usr + 0.00 sys = 1.15 CPU) @ 21739.13/s (n=25000) csimple3: 1 wallclock secs ( 1.04 usr + -0.01 sys = 1.03 CPU) @ 24271.84/s (n=25000) simple3: 5 wallclock secs ( 3.97 usr + 0.01 sys = 3.98 CPU) @ 6281.41/s (n=25000)

Here's the code I ran:

#!/usr/bin/perl use Benchmark; # Remove the following paragraph and the benchmarks for csimple3 and # cclever3 if you don't have Inline::C. use Inline C => 'DATA', VERSION => 0.0, NAME => 'SimpleTest', OPTIMIZE => '-O3'; Inline->init; sub simple3 { my($a,$b)=@_; my(@seen); return undef if (length($a) != length($b)); foreach my $i (0..length($a)) { my($ac,$bc)=(substr($a,$i,1),substr($b,$i,1)); if ($ac eq $bc) { ; # Do nothing } elsif ($ac eq '_') { return undef if (++$seen[$bc] > 1); } elsif ($bc eq '_') { return undef if (++$seen[$ac] > 1); } else { return undef } } 1; } # Represent each string as two strings and two masks. sub clever2 { # Data transformations; could be done beforehand in linear time. # Underscores become \0 for easier masking and comparison (my $a = $_[0]) =~ tr/_/\0/; (my $b = $_[1]) =~ tr/_/\0/; # Compute a "transposed" string showing the position of each charact +er, # leaving \0 if it doesn't appear. my($a3,$b3)=("\0"x10,"\0"x10); foreach my $i (0..(length($a)-1)) { my $c = substr($a,$i,1); next if $c eq "\0"; substr($a3,$c,1)=$i; } foreach my $i (0..(length($b)-1)) { my $c = substr($b,$i,1); next if $c eq "\0"; substr($b3,$c,1)=$i; } # Concatenate together my $a_new = $a . $a3; my $b_new = $b . $b3; # Compute the mask (my $a_mask = $a_new) =~ tr/\0/\xff/c; (my $b_mask = $b_new) =~ tr/\0/\xff/c; # Comparisons; must be done for each comparison. return (($a_new & $b_mask) eq ($b_new & $a_mask)); } # Same as clever2, but with data structure precomputed. sub clever3 { # a and mask[b] eq b and mask[a] ($_[0][1] & $_[1][2]) eq ($_[1][1] & $_[0][2]); } # Precompute the data structure for clever3 sub clever3_xform { # Underscore to \0 for easier masking and comparison (my $a = $_[0]) =~ tr/_/\0/; # Compute a "transposed" string showing the position of each charact +er, # leaving \0 if it doesn't appear. my($a3)="\0"x10; foreach my $i (0..(length($a)-1)) { my $c = substr($a,$i,1); next if $c eq "\0"; substr($a3,$c,1)=$i; } # Concatenate my $a_new = $a . $a3; # Calculate the mask (my $a_mask = $a_new) =~ tr/\0/\xff/c; # Return an arrayref; this is the data structure we'll use in clever +3. return [$_[0],$a_new,$a_mask]; } my @tests = (qw/ _8__3__19 48____7__ _8__3__19 4_2___7__ _8__3__19 4_8___7__ __8_3__19 48____7__ __8_3__19 84____7__ _8__3__19 48_____7_ / ); sub run_tests { my($func,$verbose,@tests)=@_; my ($s1, $s2); while (defined($s1 = shift(@tests))) { $s2 = shift(@tests); my $result = $func->($s1, $s2); if ($verbose) { if (ref($s1) && ref($s2)) { $s1 = $s1->[0]; $s2 = $s2->[0]; } print "$s1\n$s2: ",$result?"compatible":"not compatible","\n"; } } } my @tests_clever3 = map { clever3_xform($_) } @tests; run_tests(\&clever3, 1, @tests_clever3); timethese(25_000, { simple3 => sub { run_tests(\&simple3, 0, @tests) }, csimple3 => sub { run_tests(\&csimple3, 0, @tests) }, clever2 => sub { run_tests(\&clever2, 0, @tests) }, clever3 => sub { run_tests(\&clever3, 0, @tests_cleve +r3) }, cclever3 => sub { run_tests(\&cclever3, 0, @tests_cle +ver3) }, }); __DATA__ __C__ int csimple3(const char *a, const char *b) { int i; int l; unsigned char seen[256]; memset(seen,0,256); if ((l=strlen(a)) != strlen(b)) return 0; for(i=0;i<l;i++) { if (a[i] == b[i]) { ; /* Do nothing */ } else if (a[i] == '_') { if (++seen[b[i]] > 1) return 0; } else if (b[i] == '_') { if (++seen[a[i]] > 1) return 0; } else return 0; } return 1; } int cclever3(SV *a, SV *b) { AV *a_arr, *b_arr; SV **tmp; char *a_val, *a_mask, *b_val, *b_mask; int i; /* First get the arrays from the references */ if (!SvROK(a) || !SvROK(b)) croak("a or b not arrayrefs!"); a_arr = (AV*)SvRV(a); b_arr = (AV*)SvRV(b); /* Now pull out the data */ if ( (tmp = av_fetch(a_arr, 1, 0)) == NULL) croak("a[1] is undef"); if ((a_val = SvPV(*tmp, PL_na)) == NULL) croak("a[1] contains NULL pointer?"); if ( (tmp = av_fetch(a_arr, 2, 0)) == NULL) croak("a[2] is undef"); if ((a_mask = SvPV(*tmp, PL_na)) == NULL) croak("a[2] contains NULL pointer?"); if ( (tmp = av_fetch(b_arr, 1, 0)) == NULL) croak("b[1] is undef"); if ((b_val = SvPV(*tmp, PL_na)) == NULL) croak("b[1] contains NULL pointer?"); if ( (tmp = av_fetch(b_arr, 2, 0)) == NULL) croak("b[2] is undef"); if ((b_mask = SvPV(*tmp, PL_na)) == NULL) croak("b[2] contains NULL pointer?"); /* OK, finally we have all of the data! */ for(i=0;i<20;i++) { if ((a_val[i] & b_mask[i]) != (b_val[i] & a_mask[i])) return 0; } return 1; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://419179]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2020-10-31 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (291 votes). Check out past polls.

    Notices?