Pathologically Eclectic Rubbish Lister PerlMonks

### comment on

 Need Help??

Here's a solution that is very fast and manages to use string-wise boolean operations. It uses a data structure that's a little bit complicated, but since converting all the data structures is linear to the number of strings you have and you're doing the comparisons many more times than that, I think it will still offer a significant speedup.

The basic idea is this:

It's easy and fast to check whether there are different digits at the same position; you can use some kind of a mask to ignore the _'s, and then just compare the masked strings with eq. To compare two strings \$a and \$b, following aristotle's example we generate a mask with tr/\0/\377/c, then test:
```(\$a & \$b_mask) eq (\$b & \$a_mask)

It's checking whether a character appears elsewhere in the string that's slow; there's not a straightforward way to do it without looking at each character individually. But if we have a string telling at which position each character occurs, we can use a similar masking technique to see whether the same character occurs in different positions in two strings. We can get this by doing something similar to transposing a one-dimensional matrix.

Let me illustrate with an example.

If we have the string _8__3__19, we can "transpose" it into the string _7_4____18, meaning that 0, 2, 4, 5, 6, and 7 don't appear in the string at all; 1 appears at position 7, 3 appears at position 4, 8 appears at position 1, and 9 appears at position 8. Similarly, we can "transpose" 48____7__ to ____0__61_.

Now we can use the same mask-and-compare technique aristotle gave us to compare the "transposed" values, and get a very fast answer. This works because if any digit appears in only one string, it will be masked out; if it appears in both, it must be at the same position, so after masking away everything that appears in only one of the strings, what's left must be equal.

Since we can compare the strings and the "transposed" strings the same way, we can simply concatenate them together and store them with their masks for the data structure. I used an arrayref with [\$orig_str, \$str_and_xposed, \$mask], with \$str_and_xposed having underscores changed to \0.

With this data structure, the entire test becomes:

```  #    a   and mask[b]  eq      b   and mask[a]
(\$_[0][1] & \$_[1][2]) eq (\$_[1][1] & \$_[0][2]);

Here's how it benchmarks. simple3 is a simple substr-based implementation, csimple3 is an Inline::C implementation of simple3, clever2 is the above code, clever3 is the same code but doing the data structure transformations beforehand, cclever3 is an Inline::C implementation of clever3. I've actually benchmarked several other promising solutions in this thread, and this is by far the fastest.

```Benchmark: timing 25000 iterations of cclever3, clever2, clever3, csim
+ple3, simple3...
cclever3:  1 wallclock secs ( 0.95 usr +  0.00 sys =  0.95 CPU)
@ 26315.79/s (n=25000)
clever2: 10 wallclock secs ( 8.73 usr +  0.02 sys =  8.75 CPU)
@  2857.14/s (n=25000)
clever3:  1 wallclock secs ( 1.15 usr +  0.00 sys =  1.15 CPU)
@ 21739.13/s (n=25000)
csimple3:  1 wallclock secs ( 1.04 usr + -0.01 sys =  1.03 CPU)
@ 24271.84/s (n=25000)
simple3:  5 wallclock secs ( 3.97 usr +  0.01 sys =  3.98 CPU)
@ 6281.41/s (n=25000)

Here's the code I ran:

```#!/usr/bin/perl

use Benchmark;

# Remove the following paragraph and the benchmarks for csimple3 and
# cclever3 if you don't have Inline::C.
use Inline C => 'DATA',
VERSION => 0.0,
NAME => 'SimpleTest',
OPTIMIZE => '-O3';
Inline->init;

sub simple3
{
my(\$a,\$b)=@_;
my(@seen);

return undef if (length(\$a) != length(\$b));
foreach my \$i (0..length(\$a))
{
my(\$ac,\$bc)=(substr(\$a,\$i,1),substr(\$b,\$i,1));
if (\$ac eq \$bc)
{
; # Do nothing
}
elsif (\$ac eq '_')
{
return undef if (++\$seen[\$bc] > 1);
}
elsif (\$bc eq '_')
{
return undef if (++\$seen[\$ac] > 1);
}
else { return undef }
}
1;
}

# Represent each string as two strings and two masks.
sub clever2
{
# Data transformations; could be done beforehand in linear time.
# Underscores become \0 for easier masking and comparison
(my \$a = \$_[0]) =~ tr/_/\0/;
(my \$b = \$_[1]) =~ tr/_/\0/;

# Compute a "transposed" string showing the position of each charact
+er,
# leaving \0 if it doesn't appear.
my(\$a3,\$b3)=("\0"x10,"\0"x10);
foreach my \$i (0..(length(\$a)-1))
{
my \$c = substr(\$a,\$i,1);
next if \$c eq "\0";
substr(\$a3,\$c,1)=\$i;
}
foreach my \$i (0..(length(\$b)-1))
{
my \$c = substr(\$b,\$i,1);
next if \$c eq "\0";
substr(\$b3,\$c,1)=\$i;
}

# Concatenate together
my \$a_new = \$a . \$a3;
my \$b_new = \$b . \$b3;

(my \$a_mask = \$a_new)  =~ tr/\0/\xff/c;
(my \$b_mask = \$b_new)  =~ tr/\0/\xff/c;

# Comparisons; must be done for each comparison.
}

# Same as clever2, but with data structure precomputed.
sub clever3
{
(\$_[0][1] & \$_[1][2]) eq (\$_[1][1] & \$_[0][2]);
}

# Precompute the data structure for clever3
sub clever3_xform
{
# Underscore to \0 for easier masking and comparison
(my \$a = \$_[0]) =~ tr/_/\0/;

# Compute a "transposed" string showing the position of each charact
+er,
# leaving \0 if it doesn't appear.
my(\$a3)="\0"x10;
foreach my \$i (0..(length(\$a)-1))
{
my \$c = substr(\$a,\$i,1);
next if \$c eq "\0";
substr(\$a3,\$c,1)=\$i;
}

# Concatenate
my \$a_new = \$a . \$a3;

(my \$a_mask = \$a_new)  =~ tr/\0/\xff/c;

# Return an arrayref; this is the data structure we'll use in clever
+3.
}

my @tests =
(qw/
_8__3__19
48____7__

_8__3__19
4_2___7__

_8__3__19
4_8___7__

__8_3__19
48____7__

__8_3__19
84____7__

_8__3__19
48_____7_
/
);

sub run_tests
{
my(\$func,\$verbose,@tests)=@_;
my (\$s1, \$s2);
while (defined(\$s1 = shift(@tests)))
{
\$s2 = shift(@tests);
my \$result = \$func->(\$s1, \$s2);
if (\$verbose)
{
if (ref(\$s1) && ref(\$s2))
{
\$s1 = \$s1->[0];
\$s2 = \$s2->[0];
}
print "\$s1\n\$s2: ",\$result?"compatible":"not compatible","\n";
}
}
}

my @tests_clever3 = map { clever3_xform(\$_) } @tests;
run_tests(\&clever3, 1, @tests_clever3);

timethese(25_000, {
simple3 => sub { run_tests(\&simple3, 0, @tests) },
csimple3 => sub { run_tests(\&csimple3, 0, @tests) },
clever2 => sub { run_tests(\&clever2, 0, @tests) },
clever3 => sub { run_tests(\&clever3, 0, @tests_cleve
+r3) },
cclever3 => sub { run_tests(\&cclever3, 0, @tests_cle
+ver3) },
});

__DATA__
__C__
int csimple3(const char *a, const char *b)
{
int i;
int l;
unsigned char seen[256];

memset(seen,0,256);

if ((l=strlen(a)) != strlen(b))
return 0;
for(i=0;i<l;i++)
{
if (a[i] == b[i])
{
; /* Do nothing */
}
else if (a[i] == '_')
{
if (++seen[b[i]] > 1)
return 0;
}
else if (b[i] == '_')
{
if (++seen[a[i]] > 1)
return 0;
}
else
return 0;
}
return 1;
}

int cclever3(SV *a, SV *b)
{
AV *a_arr, *b_arr;
SV **tmp;
int i;

/* First  get the arrays from the references */
if (!SvROK(a) || !SvROK(b))
croak("a or b not arrayrefs!");
a_arr = (AV*)SvRV(a);
b_arr = (AV*)SvRV(b);

/* Now pull out the data */

if ( (tmp = av_fetch(a_arr, 1, 0)) == NULL)
croak("a[1] is undef");
if ((a_val = SvPV(*tmp, PL_na)) == NULL)
croak("a[1] contains NULL pointer?");

if ( (tmp = av_fetch(a_arr, 2, 0)) == NULL)
croak("a[2] is undef");
if ((a_mask = SvPV(*tmp, PL_na)) == NULL)
croak("a[2] contains NULL pointer?");

if ( (tmp = av_fetch(b_arr, 1, 0)) == NULL)
croak("b[1] is undef");
if ((b_val = SvPV(*tmp, PL_na)) == NULL)
croak("b[1] contains NULL pointer?");

if ( (tmp = av_fetch(b_arr, 2, 0)) == NULL)
croak("b[2] is undef");
if ((b_mask = SvPV(*tmp, PL_na)) == NULL)
croak("b[2] contains NULL pointer?");

/* OK, finally we have all of the data! */
for(i=0;i<20;i++)
{
return 0;
}
return 1;
}

In reply to Re: Comparison by position and value by sgifford
in thread Comparison by position and value by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?

Create A New User
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2020-10-21 03:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My favourite web site is:

Results (212 votes). Check out past polls.

Notices?