BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
Given two strings of equal length that contain digits and some placeholder value, I want a fast way to determine if the pair are 'compatible'.
Eg. Given (where underscore represents the placeholder but this could be substituted by any other non-digit value that helped the algorithm)
_8__3__19
48____7__
These two strings are compatible because no single digit appears in both strings except where it appears in the same position (8)
Whereas these two would be incompatible
_8__3__19
4_8___7__
because the digit 8 appears in both, but at a different position. And these two are incompatible
_8__3__19
48_____7_
because the second last digit in both strings contains a different value.
I know how to do this with a loop and substr and a hash, but it is rather slow and I have a set N of M sets of these strings and I want to produce a set P, where each element of P is a string that is combines one string from as many of the N sets of strings as are compatible. Given the combinatorial nature of the problem, I need to make it as quick as possible.
I keep thinking that this can be done using string-wise boolean operations, but I cannot see how?
If there is a better way of structuring the data (eg. arrays of char rather than strings) that makes the process quicker or easier that's good to.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
Re: Comparison by position and value
by ccn (Vicar) on Jan 02, 2005 at 11:29 UTC
|
#!/usr/bin/perl -wl
use strict;
sub is_compatible {
local $_ = $_[0] ^ $_[1];
return not (/[\001-\017]/ or /([\020-\031]).*?\1/s);
}
print is_compatible (' 8 3 19',
'48 7 ') ? 'yes' : 'no';
print is_compatible (' 8 3 19',
'4 8 7 ') ? 'yes' : 'no';
print is_compatible (' 8 3 19',
'48 7 ') ? 'yes' : 'no';
__END__
# output
yes
no
no
| [reply] [d/l] |
Re: Comparison by position and value
by gaal (Parson) on Jan 02, 2005 at 15:41 UTC
|
This is essentially the same as substr and a hash, but will run faster. Forgive the funky pseudocode.
sub is_compatible {
my T, B;
foreach i from 0 .. stringlen {
next if top[i] eq bottom[i]
if top[i] is a digit
return NOT COMPATIBLE if bottom[i] is a digit
T |= top[i]
if bottom[i] is a digit
B |= bottom[i]
}
return T&B ? NOT COMPATIBLE : COMPATIBLE
}
Assumptions: your "digits" are weakly fewer in number than your integer1 width. If by digit you really do mean 0 .. 9, I think this obtains on all the machines Perl runs on. In which case you can also optimize "is a digit" with low-level ASCII checks. Also, you should probably only split each string once instead of seeking into it. If this isn't fast enough, this is a good candidate for inlining in c.
1 More precisely, whatever does efficient bitwise arithmetic. | [reply] [d/l] [select] |
Re: Comparison by position and value
by aquarium (Curate) on Jan 02, 2005 at 10:23 UTC
|
| [reply] |
Re: Comparison by position and value
by ambrus (Abbot) on Jan 02, 2005 at 11:08 UTC
|
sub compatible {
my ($s1, $s2) = @_;
y/_/\0/c for $s1, $s2;
!(($s1 | $s2)=~y/\0//);
}
print(compatible("_8__3__19", "48_____7_") ? "true\n" : "false\n");
Update:
BrowserUK has pointed out that this is wrong. Indeed, it does not work if there is
the same digit in both positions of the same string.
Update: Corrected (I hope) code
sub compatible {
y/_/\477/c, y/_/\0/ for my @m = @_;
!((($_[0] ^ $_[1]) & $m[0] & $m[1]) =~ y/\0//c);
}
printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0
for [ qw[ _8__3__19 48____7__ ] ], # compat
[ qw[ _8__3__19 4_8___7__ ] ], # compat
[ qw[ _8__3__19 48_____7_ ] ]; # clash
Update: Oh, I've got it wrong again. forget this.
| [reply] [d/l] [select] |
|
#! perl -slw
use strict;
sub compatible {
my ($s1, $s2) = @_;
y/_/\0/c for $s1, $s2;
!(($s1 | $s2)=~y/\0//);
}
printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0
for [ qw[ _8__3__19 48____7__ ] ], # good
[ qw[ _8__3__19 4_8___7__ ] ], # bad
[ qw[ _8__3__19 48_____7_ ] ]; # bad
__END__
[11:18:29.65] P:\test>test
_8__3__19 v 48____7__ ? 0
_8__3__19 v 4_8___7__ ? 1
_8__3__19 v 48_____7_ ? 0
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] [d/l] |
|
_8__3__19 v
48____7__ ? 0
^
_8__3__19 v
4_8___7__ ? 1
_8__3__19 v
48_____7_ ? 0
^ ^
For these strings at least, the result is 0 iff there is a clash of two digits somewhere.
| [reply] [d/l] |
|
Re: Comparison by position and value
by jbrugger (Parson) on Jan 02, 2005 at 15:36 UTC
|
The 'subcode' indeed would work, it looks something like:
#!/usr/bin/perl
use strict;
sub compare ($$) {
my ($a,$b) = @_;
my $ok = 0;
for (my $i=0; $i < scalar(@{$a}); $i++) {
if ( @{$a}[$i] eq @{$b}[$i] && @{$a}[$i] ne "_" ) {
#compatible...
$ok=1;
} elsif (@{$a}[$i] ne @{$b}[$i] && (@{$a}[$i] ne "_" &&
@{$b}[$i] ne "_") ) {
# totally not compatible
return 0;
}
}
$ok;
}
my @a1 = qw(_ 8 _ _ 3 _ _ 1 9);
my @a2 = qw(2 _ 8 _ _ _ 7 _ _);
my @a3 = qw(4 8 _ _ _ _ _ 7 _);
my @a4 = qw(4 8 _ _ _ _ 7 _ _);
print ( "1: " . compare(\@a1, \@a2) ."\n" );
print ( "2: " . compare(\@a1, \@a3) ."\n" );
print ( "3: " . compare(\@a1, \@a4) ."\n" );
| [reply] [d/l] |
Re: Comparison by position and value
by ambrus (Abbot) on Jan 02, 2005 at 11:21 UTC
|
sub compatible {
my ($s1, $s2) = @_;
$s1=~/\G_/gcs and $s2=~/./gcs or
$s1=~/./gcs and $s2=~/\G_/gcs or
return 0 until
$s1=~/\G\z/ or $s2=~/\G\z/;
1; }
print(compatible("_8__3__19", "4_8___7__") ? "true\n" : "false\n");
Update: this is wrong too, for the same reason as my other reply.
Wait, I'll try to post a correction.
Update:
Maybe this would work:
sub compatible {
my ($s1, $s2) = @_;
my $t;
$s1=~/\G_/gcs and $s2=~/./gcs or
$s1=~/(.)/gcs and $t = $1, $s2=~/\G_/gcs or $s2=~/(.)/gcs and
+$t eq $1 or
return 0 until
$s1=~/\G\z/ or $s2=~/\G\z/;
1;
}
printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0
for [ qw[ _8__3__19 48____7__ ] ], # compat
[ qw[ _8__3__19 4_8___7__ ] ], # compat
[ qw[ _8__3__19 48_____7_ ] ]; # clash
Update: this is wrong too.
Update:
sub compatible {
my ($s1, $s2) = @_;
my($c1, $c2, $d);
{
$s1 =~ /(.)/gs or return 1;
$c1 = $1;
$s2 =~ /(.)/gs or return 1;
$c2 = $1;
if ($c1 eq "_") {
if ($c2 eq "_")
{ }
else
{ vec($d, ord $c2, 1)++ and return; }
} else {
if ($c2 eq "_")
{ vec($d, ord $c1, 1)++ and return; }
else
{ $c1 ne $c2 and return; vec($d, ord $c2, 1)++
+ and return; }
}
redo;
}
}
printf "%s v %s ? %s\n", @$_, compatible( @$_ ) || 0
for [ qw[ _8__3__19 48____7__ ] ], # compat
[ qw[ _8__3__19 4_2___7__ ] ], # compat
[ qw[ _8__3__19 4_8___7__ ] ], # clash
[ qw[ __8_3__19 48____7__ ] ], # clash
[ qw[ __8_3__19 84____7__ ] ], # clash
[ qw[ _8__3__19 48_____7_ ] ]; # clash
| [reply] [d/l] [select] |
Re: Comparison by position and value
by ambrus (Abbot) on Jan 02, 2005 at 12:03 UTC
|
sub compatible {
my ($s1, $s2) = @_;
my $m = length($s1) - 1;
($s1 . $s2) !~ /^.{0,$m}?([^_]).{$m}(?!\1)[^_]/;
}
printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0
for [ qw[ _8__3__19 48____7__ ] ], # compat
[ qw[ _8__3__19 4_8___7__ ] ], # compat
[ qw[ _8__3__19 48_____7_ ] ]; # clash
Update:And this is wrong too.
Update:
sub compatible {
my ($s1, $s2) = @_;
my $m = length($s1) - 1;
my($n, $p) = ($m - 1, $m + 1);
($s1 . $s2) !~ /^.{0,$m}?([^_]).{$m}(?!\1)[^_]/ and
($s1 . $s2) !~ /^.{0,$m}?([^_])(?:.{0,$n}|.{$p,})\1/;
}
printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0
for [ qw[ _8__3__19 48____7__ ] ], # compat
[ qw[ _8__3__19 4_8___7__ ] ], # clash
[ qw[ __8_3__19 48____7__ ] ], # clash
[ qw[ __8_3__19 84____7__ ] ], # clash
[ qw[ _8__3__19 48_____7_ ] ]; # clash
| [reply] [d/l] [select] |
Re: Comparison by position and value
by Aristotle (Chancellor) on Jan 02, 2005 at 23:40 UTC
|
sub compatible {
my( $l, $r ) = @_;
# underscores are insignificant
tr/_/\0/ for $l, $r;
# cancel out identical values
my $xor = $l ^ $r;
# convert to bitmasks
tr/\0/\377/c for $l, $r;
my $mask = $l & $r;
# masked chars must be identical
return !1 if ( $xor & $mask ) =~ tr/\0//c;
# and there may not be dupes of non-identical characters
return 0 == grep {
my $char = substr( $xor, $_, 1 );
$char ne "\0" and index( $xor, $char, $_ + 1 ) != -1
} 0 .. length( $xor ) - 1;
}
Test suite in the readmore.
Makeshifts last the longest.
| [reply] [d/l] [select] |
|
Your code appears to fail on '__8_3__19' and '84____7__', passing this as compatible when it should not be as the digit 8 appears in both strings in different positions.
It could be that in moving your function into my test script, I have broken it, but as your post doesn't contain a stand-alone testcase that I can run, I cannot verify this conjecture.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] |
|
| [reply] |
|
Re: Comparison by position and value
by steves (Curate) on Jan 02, 2005 at 16:54 UTC
|
Update: I didn't read closely enough. This doesn't
account for the case where the same digit appears in
different positions in each sequence -- i.e., I'm only
comparing each individual position.
Update 2: Here's a more complicated version that
accounts for what I missed, the assumption still being
that a given digit can only appear once in a sequence.
sub compatible
{
my ($s1, $s2) = @_;
my %hashed;
my $diff = 0;
my ($one, $two);
my @counts;
my $n = 10;
#
# Hash so that each $s1 value is a key with its corresponding
# $s2 value as the hashed value.
#
@hashed{map {($_ eq '_') ? $n++ : $_} split("", $s1)} =
map {($_ eq '_') ? $n++ : $_} split("", $s2);
#
# Compare each $s1->$s2 pair
#
while (($one, $two) = each %hashed)
{
next if ($one == $two);
last if (($diff = (++$counts[$one] - 1)) > 0);
last if (($diff = (++$counts[$two] - 1)) > 0);
$diff = 0;
next if ( ($one >= 10) || ($two >= 10) );
last if (($diff = ($one - $two)) != 0);
}
print "$s1\n$s2\n", ($diff == 0) ? "compatible" : "incompatible",
+"\n\n";
}
Revised output:
_8__3__19
48____7__
compatible
_8__3__19
4_2___7__
compatible
_8__3__19
4_8___7__
incompatible
__8_3__19
48____7__
incompatible
__8_3__19
84____7__
incompatible
_8__3__19
48_____7_
incompatible
I'm not sure how fast this is compared to the other methods,
but here's a way to do it using a hash and a short circuited
loop comparison of each hashed key/value:
use strict;
sub compatible
{
my ($s1, $s2) = @_;
my %hashed;
my $diff = 0;
my ($one, $two);
#
# Hash so that each $s1 value is a key with its corresponding
# $s2 value as the hashed value.
#
@hashed{split("", $s1)} = split("", $s2);
#
# Compare each $s1->$s2 pair
#
while (($one, $two) = each %hashed)
{
next if ( ($one eq '_') || ($two eq '_') );
last if (($diff = ($one - $two)) != 0);
}
print "$s1\n$s2\n", ($diff == 0) ? "compatible" : "incompatible",
+"\n\n";
}
my @tests =
(qw/
_8__3__19
48____7__
_8__3__19
4_2___7__
_8__3__19
4_8___7__
__8_3__19
48____7__
__8_3__19
84____7__
_8__3__19
48_____7_
/
);
my ($s1, $s2);
while (defined($s1 = shift(@tests)))
{
$s2 = shift(@tests);
compatible($s1, $s2);
}
Output:
_8__3__19
48____7__
compatible
_8__3__19
4_2___7__
compatible
_8__3__19
4_8___7__
compatible
__8_3__19
48____7__
compatible
__8_3__19
84____7__
compatible
_8__3__19
48_____7_
incompatible
| [reply] [d/l] [select] |
|
I thought the following are supposed to be incompatible becuase the digit "8" occurs in both strings but at different locations:
_8__3__19
4_8___7__
compatible
__8_3__19
48____7__
compatible
__8_3__19
84____7__
compatible
| [reply] [d/l] |
Re: Comparison by position and value
by holli (Abbot) on Jan 02, 2005 at 19:56 UTC
|
that was too tough for me, so i asked a friend of mine (who is a real perl wizard but not very talkative).
my own solution was hash-based too.
he gave me this:
use strict;
use warnings;
print &compatible ("_8__3__19", "48____7__"); #c
print &compatible ("_8__3__19", "4_8___7__"); #i
print &compatible ("_8__3__19", "48_____7_"); #i
sub compatible
{
my @s=($_[0], $_[1]);
my @d;
for( 0,1 )
{
for( $d[$_]=$s[$_] )
{
tr/0-9/\0/c; #assuming the placeholder is not \0
tr/\0/\377/c;
}
}
my $m="$d[0]" & "$d[1]";
for( 0,1 )
{
$d[$_] = "$m" & "$s[$_]";
}
my $compatible;
if( $compatible = $d[0] eq $d[1] )
{
for( 0,1 )
{
$d[$_] = {};
${$d[$_]}{$1} .= "$-[0] " while $s[$_] =~ /(\d)/g;
}
$compatible &= !grep$d[1]->{$_} && $d[1]->{$_} ne $d[0]->{$_},
+keys %{$d[0]};
}
return $compatible ? 1 : 0;;
}
i canīt say i understand it fully but normally you can rely on ozo | [reply] [d/l] |
Re: Comparison by position and value
by sgifford (Prior) on Jan 02, 2005 at 20:35 UTC
|
It's the need to check the same digit appearing in different places in
the two strings that makes this hard to do with bitwise string
operations. But the only reason that bitwise string
operations are so fast is because you can avoid making all of those
jumps into the Perl interpreter; you can hand it both the strings and
an operation, let it groove along in C for awhile, then come back with
an answer.
You can get the same effect with Inline::C.
With a simple substr-based implementation, a C version
is nearly 4 times faster than a nearly identical Perl version:
Benchmark: timing 100000 iterations of csimple3, simple3...
csimple3: 5 wallclock secs ( 4.43 usr + 0.00 sys = 4.43 CPU)
@ 22573.36/s (n=100000)
simple3: 19 wallclock secs (16.90 usr + 0.01 sys = 16.91 CPU)
@ 5913.66/s (n=100000)
Here's the code I used:
| [reply] [d/l] [select] |
Re: Comparison by position and value
by jdalbec (Deacon) on Jan 03, 2005 at 00:56 UTC
|
sub appendinverse {
my $string = shift;
my @revarray;
for my $i (0..8) {
$revarray[0+substr($string, $i, 1)] = $i;
}
delete $revarray[0];
for my $i (1..9) {
if(exists $revarray[$i]) {
$string .= $revarray[$i];
} else {
$string .= "_";
}
}
return $string;
}
# test harness stolen from steves
my @tests =
(qw/
_8__3__19
48____7__
_8__3__19
4_2___7__
_8__3__19
4_8___7__
__8_3__19
48____7__
__8_3__19
84____7__
_8__3__19
48_____7_
/
);
# Schwartzian transform
for my $i (@tests) {
$i = appendinverse $i;
# print "$i\n";
}
sub compatible {
my $a = shift;
my $b = shift;
# modified as suggested by steves
# print "\n$a\n$b\n";
print "\n",substr($a,0,9),"\n",substr($b,0,9),"\n";
if (($a^$b)=~/[\001-\017]/) {
print "incompatible\n";
} else {
print "compatible\n";
}
}
# test harness stolen from steves
my ($s1, $s2);
while (defined($s1 = shift(@tests)))
{
$s2 = shift(@tests);
compatible($s1, $s2);
}
which outputs:
_8__3__197_4____18
48____7_____0__61_
compatible
_8__3__197_4____18
4_2___7___2_0__6__
compatible
_8__3__197_4____18
4_8___7_____0__62_
incompatible
__8_3__197_4____28
48____7_____0__61_
incompatible
__8_3__197_4____28
84____7_____1__60_
incompatible
_8__3__197_4____18
48_____7____0__71_
incompatible
_8__3__19
48____7__
compatible
_8__3__19
4_2___7__
compatible
_8__3__19
4_8___7__
incompatible
__8_3__19
48____7__
incompatible
__8_3__19
84____7__
incompatible
_8__3__19
48_____7_
incompatible
| [reply] [d/l] [select] |
|
| [reply] |
Re: Comparison by position and value
by sgifford (Prior) on Jan 04, 2005 at 08:39 UTC
|
Here's a solution that is very fast and manages to use
string-wise boolean operations. It uses a data structure that's a
little bit complicated, but since converting all the data structures
is linear to the number of strings you have and you're doing the
comparisons many more times than that, I think it will still offer a
significant speedup.
The basic idea is this:
It's checking whether a character appears elsewhere in the string
that's slow; there's not a straightforward way to do it without
looking at each character individually. But if we have a
string telling at which position each character occurs, we can use a
similar masking technique to see whether the same character occurs in
different positions in two strings. We can get this by doing
something similar to transposing a one-dimensional matrix.
Let me illustrate with an example.
Since we can compare the strings and the "transposed" strings the
same way, we can simply concatenate them together and store them with
their masks for the data structure. I used an arrayref with
[$orig_str, $str_and_xposed, $mask], with
$str_and_xposed having underscores changed to
\0.
With this data structure, the entire test becomes:
# a and mask[b] eq b and mask[a]
($_[0][1] & $_[1][2]) eq ($_[1][1] & $_[0][2]);
Here's how it benchmarks. simple3 is a simple
substr-based implementation, csimple3 is an
Inline::C implementation of simple3, clever2 is the above
code, clever3 is the same code but doing the data structure
transformations beforehand, cclever3 is an Inline::C
implementation of clever3. I've actually benchmarked several other
promising solutions in this thread, and this is by far the fastest.
Benchmark: timing 25000 iterations of cclever3, clever2, clever3, csim
+ple3, simple3...
cclever3: 1 wallclock secs ( 0.95 usr + 0.00 sys = 0.95 CPU)
@ 26315.79/s (n=25000)
clever2: 10 wallclock secs ( 8.73 usr + 0.02 sys = 8.75 CPU)
@ 2857.14/s (n=25000)
clever3: 1 wallclock secs ( 1.15 usr + 0.00 sys = 1.15 CPU)
@ 21739.13/s (n=25000)
csimple3: 1 wallclock secs ( 1.04 usr + -0.01 sys = 1.03 CPU)
@ 24271.84/s (n=25000)
simple3: 5 wallclock secs ( 3.97 usr + 0.01 sys = 3.98 CPU)
@ 6281.41/s (n=25000)
Here's the code I ran:
| [reply] [d/l] [select] |
Re: Comparison by position and value
by !1 (Hermit) on Jan 02, 2005 at 21:42 UTC
|
sub compat {
my ($f,$s) = @_;
my $neg = "[^".join("",grep $_ ne "_", split//, $f)."]";
$f =~ s/((.)\1*)/$1 eq "_" ? "$neg\{".length($2)."}":"[$1_]"/ge;
return $s =~ $f;
}
| [reply] [d/l] |
Re: Comparison by position and value
by sgifford (Prior) on Jan 04, 2005 at 09:02 UTC
|
For some reason I'm really enjoying this problem.
I've benchmarked all of the promising-looking solutions here. If you don't see yours and you think it's a contender, let me know, and ideally post your benchmark code and results as a followup to this (or stick it in a scratchpad and /msg me and I'll add it).
Some of mine use Inline::C; if you don't have it and don't want it, just comment out the multi-line use statement and the Init statement, and remove the benchmarks for sgifford_csimple3 and sgifford_cclever3.
Results (slightly doctored* for better display):
ambrus: 3s ( 3.19 usr + 0.01 sys = 3.20 CPU) @ 7812.50/s
aristotle: 7s ( 5.71 usr + 0.02 sys = 5.73 CPU) @ 4363.00/s
aristotle2: 5s ( 4.56 usr + 0.00 sys = 4.56 CPU) @ 5482.46/s
ccn: 2s ( 1.74 usr + 0.00 sys = 1.74 CPU) @ 14367.82/s
sgifford_cclever3: 1s ( 0.96 usr + 0.00 sys = 0.96 CPU) @ 26041.67/s
sgifford_clever2: 9s ( 8.56 usr + 0.05 sys = 8.61 CPU) @ 2903.60/s
sgifford_clever3: 2s ( 1.19 usr + 0.01 sys = 1.20 CPU) @ 20833.33/s
sgifford_csimple3: 1s ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 24038.46/s
simple3: 5s ( 4.23 usr + 0.01 sys = 4.24 CPU) @ 5896.23/s
Code follows.
*Benchmark Doctoring Code: | [reply] [d/l] [select] |
|
|