Yzzyx has asked for the
wisdom of the Perl Monks concerning the following question:
I'm looking for a faster way to determine if all the digits in a 10 character string are unique. Running this once is fast enough, but my program does this thousands of times. Each time the string is different, of course. One slow way: #!/usr/bin/perl w
use strict;
while ( <> ) {
chomp;
die unless /^\d{10}$/;
&uniq ( $_ ) == 1 ? print "1\n" : print "0\n";
}
sub uniq {
for ( 0 .. 9 ) {
return 0 if $_[0] !~ /$_/;
}
return 1;
}
Another slow way: #!/usr/bin/perl w
use strict;
while ( <> ) {
chomp;
die unless /^\d{10}$/;
&uniq ( $_ ) == 1 ? print "1\n" : print "0\n";
}
sub uniq {
my $x = $_[0];
my @x = sort split //, $x;
$x = join "", @x;
$x =~ tr/09//s;
length $x == 10 ? return 1 : return 0;
}
Thanks!
Re: Determining uniqueness in a string.
by BrowserUk (Pope) on Oct 01, 2005 at 01:01 UTC

sub uniq{ $_[0] !~ m[(.).*\1] }
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] 

Ah crap, you beat me to it. :)
Update with a fourth sub based on BrowserUK's suggestion. Gets another 25% over mine.
Oops, fixed error in u2 that made it look about 10% faster than it was. New timings too.
use Benchmark qw( cmpthese ) ;
my $digits = 2491306578;
cmpthese( 5, {
u1 => sub{
for ( 0 .. 9 ) { return 0 if $digits !~ /$_/;} return 1
+;
},
u2 => sub{
my @x = sort split //, $digits;
my $x = join "", @x;
$x =~ tr/09//s;
length $x == 10 ? return 1 : return 0;
},
u3 => sub {
return 0 if $digits =~ /(\d).*\1/; return 1;
},
u4 => sub {
return $digits !~ /(.).*\1/;
},
} );
__END__
All unique:
Rate u1 u2 u3 u4
u1 15504/s  75% 95% 95%
u2 62910/s 306%  78% 78%
u3 289343/s 1766% 360%  1%
u4 291910/s 1783% 364% 1% 
Not Unique:
Rate u1 u2 u3 u4
u1 17396/s  72% 98% 98%
u2 62887/s 262%  92% 92%
u3 748300/s 4202% 1090%  5%
u4 783739/s 4405% 1146% 5% 
 [reply] [d/l] [select] 

use Benchmark qw( cmpthese ) ;
my $digits = 2491306578;
my @regexes = map { qr/$_/ } 0 .. 9;
cmpthese( 5, {
u1 => sub{
for ( 0 .. 9 ) {
return 0 if $digits !~ /$_/;
}
return 1;
},
u2 => sub{
my @x = sort split //, $digits;
my $x = join "", @x;
$x =~ tr/09//s;
length $x == 10 ? return 1 : return 0;
},
u3 => sub {
return 0 if $digits =~ /(\d).*\1/; return 1;
},
u4 => sub {
return $digits !~ /(.).*\1/;
},
u5 => sub {
for (@regexes) {
return 0 if $digits !~ $_;
}
return 1;
}
Rate u1 u2 u5 u3 u4
u1 10555/s  71% 90% 93% 93%
u2 35955/s 241%  65% 76% 76%
u5 102215/s 868% 184%  32% 32%
u3 149240/s 1314% 315% 46%  1%
u4 150091/s 1322% 317% 47% 1% 
 [reply] [d/l] [select] 


Thanks for the benchmark numbers... I plugged each into my program and the speed improvement was pretty impressive. Now I will spend a while understanding what those code fragments do. :)
 [reply] 



uniq ( '192345810' ) == 1 ? print "1\n" : print "0\n";
sub uniq {
($result) = ($_[0] =~ m[((.).*\2)]);
print "result: $result; captured: $1\n";
}
 [reply] [d/l] 
Re: Determining uniqueness in a string.
by davido (Archbishop) on Oct 01, 2005 at 00:49 UTC

sub uniq {
my %hash;
@hash{ split //, $_[0] } = ();
return length $_[0] == keys %hash
}
 [reply] [d/l] 
Re: Determining uniqueness in a string.
by davido (Archbishop) on Oct 01, 2005 at 06:18 UTC

I couldn't leave well enough alone on this...
The regexp solution benefits from the very efficient regexp engine. But it is a solution that is built upon a bigO polynomial algorithm. If we expand the problem to finding uniqueness in strings consisting of threecharacterwide groups of alphabetical characters, that gives us a lot of room for dataset growth while maintaining a string of unique groups. The hash solution grows at O(n) since each hash insert occurs at an average of O(1). I can't quite figure out how bad the regular expression approach gets as the string grows, but it's probably something like O(n^2) or worse.
For short test strings the raw speed of the regexp engine wins over the complexity of the hashing algorithm. But for longer strings, there's literally no comparison. Here's some test code:
use strict;
use warnings;
use Benchmark qw( cmpthese ) ;
use vars qw/$tuplets $template/;
$tuplets = join '', ( 'aaa' .. 'caa' );
$template = join '', 'a3' x ( length( $tuplets ) / 3 );
print "Test string contains ",
length( $tuplets ) / 3,
" groups.\n\n";
cmpthese(
10,
{
regexp => sub {
return $tuplets !~ /^(?:.{3})*(.{3})(?:.{3})*\1/;
},
hash => sub {
my %hash;
@hash{ unpack $template, $tuplets } = ();
return(
length( $tuplets ) / 3 == keys( %hash )
);
}
}
);
And the results on my slow PentiumII laptop:
Test string contains 1353 groups.
s/iter regexp hash
regexp 1.15  98%
hash 1.84e002 6123% 
At first I thought my eyes were decieving me. 1.84e002 iterations per second? That's horrible. But then I realized that the regexp solution was so slow that Benchmark switched to showing seconds per iteration. So it takes 1.15 seconds per iteration for the regexp approach in my test example, and a blink of an eye (1.84e002) for the hash approach with a test string of 1353 groups. Try testing 'aaa' .. 'faa'. You'll have to increase the testing time about a minute to even get reliable results out of Benchmark at that point because the regexp approach becomes so sluggish.
Of course this is a contrived example, but aren't they all? ;) And I did have to modify the RE a little so that it would maintain proper framing. But the discussion caught my attention and I just had to prove to myself what I already suspected.
 [reply] [d/l] [select] 
Re: Determining uniqueness in a string. (2x)
by tye (Sage) on Oct 01, 2005 at 06:00 UTC

The fastest one I found (the only one I tried that was faster than the /(.).*\1/ variations) was certainly not the shortest:
$digits =~ /^(?=.*?0)(?=.*?1)(?=.*?2)(?=.*?3)(?=.*?4)(?=.*?5)(?=.*?6)(
+?=.*?7)(?=.*?8)(?=.*?9)/;
But it was about twice as fast.
Update: Added the '^' to make failure cases fast.
The (?=.*?0) asserts that $digits contains a '0'. If that fails, then the regex fails. If a '0' is found, then (since that was a lookahead assertion), we start over at the beginning of $digits and (?=.*?1) asserts that $digits contains a '1'. etc.
 [reply] [d/l] [select] 
Re: Determining uniqueness in a string.
by ioannis (Prior) on Oct 01, 2005 at 01:01 UTC

To speed things up, one easy step is to precompile the
(static) pattern using the qr// operator, like qr/^\d{10}$/, or using the m//o flag, as in
die unless /^\d{10}$/o;
Also, the print statement might prove too expensive; perhaps,
syswrite() is a better choice  if you must write to output.  [reply] [d/l] 

 [reply] 
Re: Determining uniqueness in a string.
by rtst (Initiate) on Oct 02, 2005 at 04:38 UTC

Sometimes verbose code is faster. This beat out the above regexp examples in all the cases I threw at it. (NOTE: the "Note unique" example is a best case scenario for this example, but the worst case of "All unique" still performs better").
use Benchmark qw( cmpthese ) ;
my $digits = 2491306578;
cmpthese( 5, {
u1 => sub{
for ( 0 .. 9 ) { return 0 if $digits !~ /$_/;} return 1;
},
u2 => sub{
my @x = sort split //, $digits;
my $x = join "", @x;
$x =~ tr/09//s;
length $x == 10 ? return 1 : return 0;
},
u3 => sub {
return 0 if $digits =~ /(\d).*\1/; return 1;
},
u4 => sub {
return $digits !~ /(.).*\1/;
},
u9 => sub {
return 0 if index($digits, substr($digits,0,1), 1) != 1;
return 0 if index($digits, substr($digits,1,1), 2) != 1;
return 0 if index($digits, substr($digits,2,1), 3) != 1;
return 0 if index($digits, substr($digits,3,1), 4) != 1;
return 0 if index($digits, substr($digits,4,1), 5) != 1;
return 0 if index($digits, substr($digits,5,1), 6) != 1;
return 0 if index($digits, substr($digits,6,1), 7) != 1;
return 0 if index($digits, substr($digits,7,1), 8) != 1;
return 0 if index($digits, substr($digits,8,1), 9) != 1;
return 1;
},
} );
__END__
All unique (2491306578):
Rate u1 u2 u3 u4 u9
u1 14387/s  71% 91% 91% 93%
u2 50142/s 249%  67% 67% 74%
u3 153182/s 965% 205%  0% 22%
u4 153395/s 966% 206% 0%  22%
u9 195811/s 1261% 291% 28% 28% 
Not unique (1102345678):
Rate u1 u2 u3 u4 u9
u1 14381/s  71% 97% 97% 99%
u2 50429/s 251%  91% 91% 97%
u3 531431/s 3595% 954%  4% 65%
u4 556425/s 3769% 1003% 5%  64%
u9 1539566/s 10606% 2953% 190% 177% 
 [reply] [d/l] [select] 

You have a point, though if I were going the verbose route, I would use the transliterate function rather than an indexsubstr. The back reference method come out slightly faster on my computer than indexsubstr with transliterate sightly ahead of both. It is moot however, tye's solution above (u5) pretty much spanks all of them.
use Benchmark qw( cmpthese ) ;
my $digits = 2491306578;
cmpthese( 5, {
u1 => sub{
for ( 0 .. 9 ) { return 0 if $digits !~ /$_/;} return 1;
},
u2 => sub{
my @x = sort split //, $digits;
my $x = join "", @x;
$x =~ tr/09//s;
length $x == 10 ? return 1 : return 0;
},
u3 => sub {
return 0 if $digits =~ /(\d).*\1/; return 1;
},
u4 => sub {
return $digits !~ /(.).*\1/;
},
u5 => sub {
return $digits =~ /^(?=.*?0)(?=.*?1)(?=.*?2)(?=.*?3)(?=.*?4)
+(?=.*?5)(?=.*?6)(?=.*?7)(?=.*?8)(?=.*?9)/;
},
u6 => sub {
return 0 unless $digits =~ y/0/0/;
return 0 unless $digits =~ y/1/1/;
return 0 unless $digits =~ y/2/2/;
return 0 unless $digits =~ y/3/3/;
return 0 unless $digits =~ y/4/4/;
return 0 unless $digits =~ y/5/5/;
return 0 unless $digits =~ y/6/6/;
return 0 unless $digits =~ y/7/7/;
return 0 unless $digits =~ y/8/8/;
return 0 unless $digits =~ y/9/9/;
return 1;
},
u9 => sub {
return 0 if index($digits, substr($digits,0,1), 1) != 1;
return 0 if index($digits, substr($digits,1,1), 2) != 1;
return 0 if index($digits, substr($digits,2,1), 3) != 1;
return 0 if index($digits, substr($digits,3,1), 4) != 1;
return 0 if index($digits, substr($digits,4,1), 5) != 1;
return 0 if index($digits, substr($digits,5,1), 6) != 1;
return 0 if index($digits, substr($digits,6,1), 7) != 1;
return 0 if index($digits, substr($digits,7,1), 8) != 1;
return 0 if index($digits, substr($digits,8,1), 9) != 1;
return 1;
},
} );
__END__
Rate u1 u2 u9 u4 u3 u6 u5
u1 10982/s  72% 95% 95% 95% 96% 97%
u2 39808/s 262%  83% 83% 83% 85% 90%
u9 229374/s 1989% 476%  4% 4% 12% 42%
u4 238244/s 2069% 498% 4%  1% 9% 40%
u3 240061/s 2086% 503% 5% 1%  8% 39%
u6 261581/s 2282% 557% 14% 10% 9%  34%
u5 394564/s 3493% 891% 72% 66% 64% 51% 
 [reply] [d/l] [select] 

You're measuring the speed of a positive match. How about the speed of determining failure? That's an important factor too.
 [reply] 
Re: Determining uniqueness in a string
by LucaPette (Friar) on Oct 02, 2005 at 09:28 UTC

Hi,
Unfortunately at this moment i can't test the speed of my little idea about your problem. Well, i think that a simple solution is to add all digits and test if the result is 45, infact the sum of all digits in a ten character string is 45 only where every digit from the set 09 occurs exactly once, according to the Gauss' formula (N*(N*1)/2). I hope that, in spite of my ugly english, this little idea can help someone.  [reply] 

9 + 8 + 7 + 6 + 5 + 4 + 1 + 1 + 2 + 2 = 45
 [reply] 

errare humanum est... your reply is an incitation for me to being more careful and above all testing my little ideas.
 [reply] 
Re: Determining uniqueness in a string.
by Moron (Curate) on Oct 03, 2005 at 13:25 UTC

A combination of all ten digits uniquely is a permutation which can be generated using Algorithm::FastPermute.
So given that the code has to execute for thousands of candidate combinations, it seems it might be better to store all the permutations as the keys of a hash (with value 1 throughout) only once and start accepting candidates to check just by looking up in the hash. If using Linux, an additional optimisation could be to store the precalculated set of permutations in a Storable on the /shm (persistent shared memory) device, so that the hash of valid permutations doesn't even need to be calculated between runs of the program and can be loaded directly from persistent shared memory at the beginning of any given run.
 [reply] 

 [reply] [d/l] 

There will be 10! entries in the the hash. Extrapolating the size it takes to store 10!/10, 10!/5 and 10!/2 10 character strings in a hash, I estimate the needed size to be around 170Mb. Trying to stuff 10! of those strings in a hash cause all the machines I tried it on to swap  with top showing a memory usage of around half a Gb of the running program.
 [reply] 

I had thought of that  but 170 Mb would be okay for all my machines  one compromise might be to store say 5 digits (10P5) with a simple array of the remaining 5 being referenced as the value, e.g.:
$hash{ '01234' } = \( 5..9 );
This would reduce the storage by a factor of more than a hundred, while requiring that the last half of the digits would then need to be compared with the array being also prestored at each outer node of the hash, but with the same 120x performance for those last digits as compared with processing 10 without prestored results  total performance improvement should be around 80x in this case (but PLUS the fixed overheads if you don't have linux shared memory to prestore the solution)
 [reply] [d/l] 
Re: Determining uniqueness in a string.
by Anonymous Monk on Oct 01, 2005 at 21:53 UTC

(pseudocode because my perl has become rusty and i would probably get the syntax wrong)
use an array $d[0] to d9, set all elements to 0;
loop over digits in string:
if ( $d$_ ) return 0;
$d$_ = 1; # should be faster then adding 1
return 1;  [reply] 

sub uni{
my ($s) = shift;
my @a;
for $i (0..9) {
$j = int(substr($s,$i,1));
if ($a[$j]) {return( 0 )};
$a[$j] = 1;}
return 1; }
print uni("1902356784");
 [reply] [d/l] 

Regex seems to be faster.
use Inline C;
use Benchmark qw( cmpthese ) ;
my $digits = '1234567891';
my $bob = {
u1 => sub {
return uniq($digits);
},
u2 => sub {
my @l = ();
for my $x (split '', $digits) {
if ($l[$x]++) {
return 1;
}
}
return 0;
},
u3 => sub {
my @l = ();
my $max = length($digits);
for (my $i=0; $i < $max; $i++) {
$x = substr($digits,$i,1);
if ($l[$x]++) {
return 1;
}
}
return 0;
},
u4 => sub {
return $digits !~ /(.).*\1/ ? 0:1;
},
};
print "u1: ", $bob>{u1}>(), "\n";
print "u2: ", $bob>{u2}>(), "\n";
print "u3: ", $bob>{u3}>(), "\n";
print "u4: ", $bob>{u4}>(), "\n";
cmpthese( 5, $bob);
__END__
Rate u2 u3 u4 u1
u2 19725/s  10% 90% 95%
u3 21847/s 11%  89% 94%
u4 204817/s 938% 838%  44%
u1 367282/s 1762% 1581% 79% 
__C__
int uniq(char* name) {
int bob[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
int max = strlen(name);
int x;
for (x = 0; x < max; x++) {
if (bob[name[x]'0']++) {
return 1;
}
}
return 0;
}
 gam3
A picture is worth a thousand words, but takes 200K.
 [reply] [d/l] 
Re: Determining uniqueness in a string.
by creamygoodness (Curate) on Oct 03, 2005 at 23:49 UTC

If you're gonna get this nuts about speed, you may as well fire up Inline C or XS. If you know XS, this solution is actually easier to grok than some of the more esoteric solutions above, as well as faster:
!/usr/bin/perl
use strict;
use warnings;
# fast_uniq.plx  10 unique digits in a 10char string
use Benchmark qw( cmpthese );
use Inline C => <<'END_C';
/* works for ASCII, EBCDIC, UTF8 */
int offset;
SV* digits_sv;
void prime_variables (SV* input_sv, SV* zero_sv) {
digits_sv = newSVsv(input_sv);
char* zero_str = SvPV_nolen(zero_sv);
offset = *zero_str;
}
int test_uniq_digits () {
STRLEN digits_len = SvCUR( digits_sv );
char* string = SvPV( digits_sv, digits_len );
char test_buf[10] = { 0,0,0,0,0, 0,0,0,0,0 };
int i, index;
for (i = 0; i < 10; i++) {
index = *string  offset;
if (index < 0  index > 9)
croak("illegal character: '%c'", *string);
if (test_buf[index] == 1)
return 0;
test_buf[index] = 1;
string++;
}
return 1;
}
END_C
run_test("0123456789", "UNIQUE:\n");
run_test("1123456789", "NOT_UNIQUE:\n");
sub run_test {
my ($digits, $message) = @_;
prime_variables($digits, "0");
print $message;
cmpthese( 5, {
u1 => sub {
for ( 0 .. 9 ) {
return 0 if $digits !~ /$_/;
}
return 1;
},
u1000 => sub { test_uniq_digits },
});
print "\n\n";
}
Here's the output:
UNIQUE:
Rate u1 u1000
u1 10619/s  98%
u1000 692527/s 6421% 
NOT_UNIQUE:
Rate u1 u1000
u1 414417/s  42%
u1000 712019/s 72% 
 [reply] [d/l] [select] 

__DATA__
__C__
int uniq( char* s ) {
int i;
int t = 0;
for( i=0; i<10; i++ ) {
t = 1 << ( s[i]  48 );
}
return t == 1023;
}
which doesn't seem likely, then the big hash lookup will still be faster,
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] 

 [reply] 



Re: Determining uniqueness in a string.
by QM (Parson) on Oct 03, 2005 at 20:58 UTC

 [reply] 

