Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Determining uniqueness in a string.

by Yzzyx (Beadle)
on Oct 01, 2005 at 00:37 UTC ( #496566=perlquestion: print w/ replies, xml ) Need Help??
Yzzyx has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a faster way to determine if all the digits in a 10 character string are unique.

Running this once is fast enough, but my program does this thousands of times. Each time the string is different, of course.

One slow way:

#!/usr/bin/perl -w use strict; while ( <> ) { chomp; die unless /^\d{10}$/; &uniq ( $_ ) == 1 ? print "1\n" : print "0\n"; } sub uniq { for ( 0 .. 9 ) { return 0 if $_[0] !~ /$_/; } return 1; }

Another slow way:

#!/usr/bin/perl -w use strict; while ( <> ) { chomp; die unless /^\d{10}$/; &uniq ( $_ ) == 1 ? print "1\n" : print "0\n"; } sub uniq { my $x = $_[0]; my @x = sort split //, $x; $x = join "", @x; $x =~ tr/0-9//s; length $x == 10 ? return 1 : return 0; }

Thanks!

Comment on Determining uniqueness in a string.
Select or Download Code
Re: Determining uniqueness in a string.
by davido (Archbishop) on Oct 01, 2005 at 00:49 UTC

    Hashes are good at determining uniqueness. Here's an example:

    sub uniq { my %hash; @hash{ split //, $_[0] } = (); return length $_[0] == keys %hash }

    Dave

Re: Determining uniqueness in a string.
by BrowserUk (Pope) on Oct 01, 2005 at 01:01 UTC

    This might prove to be a little quicker

    sub uniq{ $_[0] !~ m[(.).*\1] }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

      Ah crap, you beat me to it. :)

      Update with a fourth sub based on BrowserUK's suggestion. Gets another 2-5% over mine.

      Oops, fixed error in u2 that made it look about 10% faster than it was. New timings too.

      use Benchmark qw( cmpthese ) ; my $digits = 2491306578; cmpthese( -5, { u1 => sub{ for ( 0 .. 9 ) { return 0 if $digits !~ /$_/;} return 1 +; }, u2 => sub{ my @x = sort split //, $digits; my $x = join "", @x; $x =~ tr/0-9//s; length $x == 10 ? return 1 : return 0; }, u3 => sub { return 0 if $digits =~ /(\d).*\1/; return 1; }, u4 => sub { return $digits !~ /(.).*\1/; }, } ); __END__
      All unique: Rate u1 u2 u3 u4 u1 15504/s -- -75% -95% -95% u2 62910/s 306% -- -78% -78% u3 289343/s 1766% 360% -- -1% u4 291910/s 1783% 364% 1% -- Not Unique: Rate u1 u2 u3 u4 u1 17396/s -- -72% -98% -98% u2 62887/s 262% -- -92% -92% u3 748300/s 4202% 1090% -- -5% u4 783739/s 4405% 1146% 5% --
        Thanks for the benchmark numbers... I plugged each into my program and the speed improvement was pretty impressive. Now I will spend a while understanding what those code fragments do. :)
        Precompiling regexes can make a dramatic difference. Compare the stats below for u1 vs u5 -- there's an 868% speedup on my box!

        Nevertheless, the backreference is just a better algo, and it still wins.

        use Benchmark qw( cmpthese ) ; my $digits = 2491306578; my @regexes = map { qr/$_/ } 0 .. 9; cmpthese( -5, { u1 => sub{ for ( 0 .. 9 ) { return 0 if $digits !~ /$_/; } return 1; }, u2 => sub{ my @x = sort split //, $digits; my $x = join "", @x; $x =~ tr/0-9//s; length $x == 10 ? return 1 : return 0; }, u3 => sub { return 0 if $digits =~ /(\d).*\1/; return 1; }, u4 => sub { return $digits !~ /(.).*\1/; }, u5 => sub { for (@regexes) { return 0 if $digits !~ $_; } return 1; }

        Rate u1 u2 u5 u3 u4 u1 10555/s -- -71% -90% -93% -93% u2 35955/s 241% -- -65% -76% -76% u5 102215/s 868% 184% -- -32% -32% u3 149240/s 1314% 315% 46% -- -1% u4 150091/s 1322% 317% 47% 1% --
        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com
      Wow.

      Not quite sure how it works. Is it doing something like the code below?:

      uniq ( '192345810' ) == 1 ? print "1\n" : print "0\n"; sub uniq { ($result) = ($_[0] =~ m[((.).*\2)]); print "result: $result; captured: $1\n"; }
Re: Determining uniqueness in a string.
by ioannis (Priest) on Oct 01, 2005 at 01:01 UTC
    To speed things up, one easy step is to precompile the (static) pattern using the qr// operator, like qr/^\d{10}$/, or using the m//o flag, as in
    die unless /^\d{10}$/o;
    Also, the print statement might prove too expensive; perhaps, syswrite() is a better choice -- if you must write to output.

      No, using /o here isn't buying anything. And using syswrite() instead of print under the pretense of improving performance is probably misguided.

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: Determining uniqueness in a string. (2x)
by tye (Cardinal) on Oct 01, 2005 at 06:00 UTC

    The fastest one I found (the only one I tried that was faster than the /(.).*\1/ variations) was certainly not the shortest:

    $digits =~ /^(?=.*?0)(?=.*?1)(?=.*?2)(?=.*?3)(?=.*?4)(?=.*?5)(?=.*?6)( +?=.*?7)(?=.*?8)(?=.*?9)/;

    But it was about twice as fast.

    Update: Added the '^' to make failure cases fast.

    The (?=.*?0) asserts that $digits contains a '0'. If that fails, then the regex fails. If a '0' is found, then (since that was a look-ahead assertion), we start over at the beginning of $digits and (?=.*?1) asserts that $digits contains a '1'. etc.

    - tye        

Re: Determining uniqueness in a string.
by davido (Archbishop) on Oct 01, 2005 at 06:18 UTC

    I couldn't leave well enough alone on this...

    The regexp solution benefits from the very efficient regexp engine. But it is a solution that is built upon a big-O polynomial algorithm. If we expand the problem to finding uniqueness in strings consisting of three-character-wide groups of alphabetical characters, that gives us a lot of room for dataset growth while maintaining a string of unique groups. The hash solution grows at O(n) since each hash insert occurs at an average of O(1). I can't quite figure out how bad the regular expression approach gets as the string grows, but it's probably something like O(n^2) or worse.

    For short test strings the raw speed of the regexp engine wins over the complexity of the hashing algorithm. But for longer strings, there's literally no comparison. Here's some test code:

    use strict; use warnings; use Benchmark qw( cmpthese ) ; use vars qw/$tuplets $template/; $tuplets = join '', ( 'aaa' .. 'caa' ); $template = join '', 'a3' x ( length( $tuplets ) / 3 ); print "Test string contains ", length( $tuplets ) / 3, " groups.\n\n"; cmpthese( -10, { regexp => sub { return $tuplets !~ /^(?:.{3})*(.{3})(?:.{3})*\1/; }, hash => sub { my %hash; @hash{ unpack $template, $tuplets } = (); return( length( $tuplets ) / 3 == keys( %hash ) ); } } );

    And the results on my slow Pentium-II laptop:

    Test string contains 1353 groups. s/iter regexp hash regexp 1.15 -- -98% hash 1.84e-002 6123% --

    At first I thought my eyes were decieving me. 1.84e-002 iterations per second? That's horrible. But then I realized that the regexp solution was so slow that Benchmark switched to showing seconds per iteration. So it takes 1.15 seconds per iteration for the regexp approach in my test example, and a blink of an eye (1.84e-002) for the hash approach with a test string of 1353 groups. Try testing 'aaa' .. 'faa'. You'll have to increase the testing time about a minute to even get reliable results out of Benchmark at that point because the regexp approach becomes so sluggish.

    Of course this is a contrived example, but aren't they all? ;) And I did have to modify the RE a little so that it would maintain proper framing. But the discussion caught my attention and I just had to prove to myself what I already suspected.


    Dave

Re: Determining uniqueness in a string.
by Anonymous Monk on Oct 01, 2005 at 21:53 UTC
    (pseudocode because my perl has become rusty and i would probably get the syntax wrong) use an array $d[0] to d9, set all elements to 0; loop over digits in string: if ( $d$_ ) return 0; $d$_ = 1; # should be faster then adding 1 return 1;
      Oops, formatting necesse est!

      Its only now that I saw the other solutions. I didn't time it, but regex should be to much overhead. Is that correct?

      sub uni{ my ($s) = shift; my @a; for $i (0..9) { $j = int(substr($s,$i,1)); if ($a[$j]) {return( 0 )}; $a[$j] = 1;} return 1; } print uni("1902356784");
        Regex seems to be faster.
        use Inline C; use Benchmark qw( cmpthese ) ; my $digits = '1234567891'; my $bob = { u1 => sub { return uniq($digits); }, u2 => sub { my @l = (); for my $x (split '', $digits) { if ($l[$x]++) { return 1; } } return 0; }, u3 => sub { my @l = (); my $max = length($digits); for (my $i=0; $i < $max; $i++) { $x = substr($digits,$i,1); if ($l[$x]++) { return 1; } } return 0; }, u4 => sub { return $digits !~ /(.).*\1/ ? 0:1; }, }; print "u1: ", $bob->{u1}->(), "\n"; print "u2: ", $bob->{u2}->(), "\n"; print "u3: ", $bob->{u3}->(), "\n"; print "u4: ", $bob->{u4}->(), "\n"; cmpthese( -5, $bob); __END__ Rate u2 u3 u4 u1 u2 19725/s -- -10% -90% -95% u3 21847/s 11% -- -89% -94% u4 204817/s 938% 838% -- -44% u1 367282/s 1762% 1581% 79% -- __C__ int uniq(char* name) { int bob[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; int max = strlen(name); int x; for (x = 0; x < max; x++) { if (bob[name[x]-'0']++) { return 1; } } return 0; }
        -- gam3
        A picture is worth a thousand words, but takes 200K.
Re: Determining uniqueness in a string.
by rtst (Initiate) on Oct 02, 2005 at 04:38 UTC
    Sometimes verbose code is faster. This beat out the above regexp examples in all the cases I threw at it. (NOTE: the "Note unique" example is a best case scenario for this example, but the worst case of "All unique" still performs better").
    use Benchmark qw( cmpthese ) ; my $digits = 2491306578; cmpthese( -5, { u1 => sub{ for ( 0 .. 9 ) { return 0 if $digits !~ /$_/;} return 1; }, u2 => sub{ my @x = sort split //, $digits; my $x = join "", @x; $x =~ tr/0-9//s; length $x == 10 ? return 1 : return 0; }, u3 => sub { return 0 if $digits =~ /(\d).*\1/; return 1; }, u4 => sub { return $digits !~ /(.).*\1/; }, u9 => sub { return 0 if index($digits, substr($digits,0,1), 1) != -1; return 0 if index($digits, substr($digits,1,1), 2) != -1; return 0 if index($digits, substr($digits,2,1), 3) != -1; return 0 if index($digits, substr($digits,3,1), 4) != -1; return 0 if index($digits, substr($digits,4,1), 5) != -1; return 0 if index($digits, substr($digits,5,1), 6) != -1; return 0 if index($digits, substr($digits,6,1), 7) != -1; return 0 if index($digits, substr($digits,7,1), 8) != -1; return 0 if index($digits, substr($digits,8,1), 9) != -1; return 1; }, } ); __END__
    All unique (2491306578): Rate u1 u2 u3 u4 u9 u1 14387/s -- -71% -91% -91% -93% u2 50142/s 249% -- -67% -67% -74% u3 153182/s 965% 205% -- -0% -22% u4 153395/s 966% 206% 0% -- -22% u9 195811/s 1261% 291% 28% 28% -- Not unique (1102345678): Rate u1 u2 u3 u4 u9 u1 14381/s -- -71% -97% -97% -99% u2 50429/s 251% -- -91% -91% -97% u3 531431/s 3595% 954% -- -4% -65% u4 556425/s 3769% 1003% 5% -- -64% u9 1539566/s 10606% 2953% 190% 177% --

      You have a point, though if I were going the verbose route, I would use the transliterate function rather than an index-substr. The back reference method come out slightly faster on my computer than index-substr with transliterate sightly ahead of both. It is moot however, tye's solution above (u5) pretty much spanks all of them.

      use Benchmark qw( cmpthese ) ; my $digits = 2491306578; cmpthese( -5, { u1 => sub{ for ( 0 .. 9 ) { return 0 if $digits !~ /$_/;} return 1; }, u2 => sub{ my @x = sort split //, $digits; my $x = join "", @x; $x =~ tr/0-9//s; length $x == 10 ? return 1 : return 0; }, u3 => sub { return 0 if $digits =~ /(\d).*\1/; return 1; }, u4 => sub { return $digits !~ /(.).*\1/; }, u5 => sub { return $digits =~ /^(?=.*?0)(?=.*?1)(?=.*?2)(?=.*?3)(?=.*?4) +(?=.*?5)(?=.*?6)(?=.*?7)(?=.*?8)(?=.*?9)/; }, u6 => sub { return 0 unless $digits =~ y/0/0/; return 0 unless $digits =~ y/1/1/; return 0 unless $digits =~ y/2/2/; return 0 unless $digits =~ y/3/3/; return 0 unless $digits =~ y/4/4/; return 0 unless $digits =~ y/5/5/; return 0 unless $digits =~ y/6/6/; return 0 unless $digits =~ y/7/7/; return 0 unless $digits =~ y/8/8/; return 0 unless $digits =~ y/9/9/; return 1; }, u9 => sub { return 0 if index($digits, substr($digits,0,1), 1) != -1; return 0 if index($digits, substr($digits,1,1), 2) != -1; return 0 if index($digits, substr($digits,2,1), 3) != -1; return 0 if index($digits, substr($digits,3,1), 4) != -1; return 0 if index($digits, substr($digits,4,1), 5) != -1; return 0 if index($digits, substr($digits,5,1), 6) != -1; return 0 if index($digits, substr($digits,6,1), 7) != -1; return 0 if index($digits, substr($digits,7,1), 8) != -1; return 0 if index($digits, substr($digits,8,1), 9) != -1; return 1; }, } ); __END__
      Rate u1 u2 u9 u4 u3 u6 u5 u1 10982/s -- -72% -95% -95% -95% -96% -97% u2 39808/s 262% -- -83% -83% -83% -85% -90% u9 229374/s 1989% 476% -- -4% -4% -12% -42% u4 238244/s 2069% 498% 4% -- -1% -9% -40% u3 240061/s 2086% 503% 5% 1% -- -8% -39% u6 261581/s 2282% 557% 14% 10% 9% -- -34% u5 394564/s 3493% 891% 72% 66% 64% 51% --
        You're measuring the speed of a positive match. How about the speed of determining failure? That's an important factor too.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Determining uniqueness in a string
by LucaPette (Friar) on Oct 02, 2005 at 09:28 UTC
    Hi, Unfortunately at this moment i can't test the speed of my little idea about your problem. Well, i think that a simple solution is to add all digits and test if the result is 45, infact the sum of all digits in a ten character string is 45 only where every digit from the set 0-9 occurs exactly once, according to the Gauss' formula (N*(N*1)/2). I hope that, in spite of my ugly english, this little idea can help someone.
      9 + 8 + 7 + 6 + 5 + 4 + 1 + 1 + 2 + 2 = 45
        errare humanum est... your reply is an incitation for me to being more careful and above all testing my little ideas.
Re: Determining uniqueness in a string.
by Moron (Curate) on Oct 03, 2005 at 13:25 UTC
    A combination of all ten digits uniquely is a permutation which can be generated using Algorithm::FastPermute.

    So given that the code has to execute for thousands of candidate combinations, it seems it might be better to store all the permutations as the keys of a hash (with value 1 throughout) only once and start accepting candidates to check just by looking up in the hash. If using Linux, an additional optimisation could be to store the pre-calculated set of permutations in a Storable on the /shm (persistent shared memory) device, so that the hash of valid permutations doesn't even need to be calculated between runs of the program and can be loaded directly from persistent shared memory at the beginning of any given run.

    -M

    Free your mind

      There will be 10! entries in the the hash. Extrapolating the size it takes to store 10!/10, 10!/5 and 10!/2 10 character strings in a hash, I estimate the needed size to be around 170Mb. Trying to stuff 10! of those strings in a hash cause all the machines I tried it on to swap - with top showing a memory usage of around half a Gb of the running program.
      Perl --((8:>*
        I had thought of that - but 170 Mb would be okay for all my machines - one compromise might be to store say 5 digits (10P5) with a simple array of the remaining 5 being referenced as the value, e.g.:
        $hash{ '01234' } = \( 5..9 );
        This would reduce the storage by a factor of more than a hundred, while requiring that the last half of the digits would then need to be compared with the array being also pre-stored at each outer node of the hash, but with the same 120x performance for those last digits as compared with processing 10 without prestored results -- total performance improvement should be around 80x in this case (but PLUS the fixed overheads if you don't have linux shared memory to pre-store the solution)

        -M

        Free your mind

      If you have 350 MB available for storing the hash, this works out to be about 60% faster than tye's regex (excluding generation and build time).


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Determining uniqueness in a string.
by QM (Vicar) on Oct 03, 2005 at 20:58 UTC
    This reminded me of this, which makes me ask:
    How do you compute the de Bruijn sequence for a given alphabet?"

    Would it be useful for solving this problem (i.e., would there be any non-member substrings)?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Determining uniqueness in a string.
by creamygoodness (Curate) on Oct 03, 2005 at 23:49 UTC
    If you're gonna get this nuts about speed, you may as well fire up Inline C or XS. If you know XS, this solution is actually easier to grok than some of the more esoteric solutions above, as well as faster:

    !/usr/bin/perl use strict; use warnings; # fast_uniq.plx -- 10 unique digits in a 10-char string use Benchmark qw( cmpthese ); use Inline C => <<'END_C'; /* works for ASCII, EBCDIC, UTF-8 */ int offset; SV* digits_sv; void prime_variables (SV* input_sv, SV* zero_sv) { digits_sv = newSVsv(input_sv); char* zero_str = SvPV_nolen(zero_sv); offset = *zero_str; } int test_uniq_digits () { STRLEN digits_len = SvCUR( digits_sv ); char* string = SvPV( digits_sv, digits_len ); char test_buf[10] = { 0,0,0,0,0, 0,0,0,0,0 }; int i, index; for (i = 0; i < 10; i++) { index = *string - offset; if (index < 0 || index > 9) croak("illegal character: '%c'", *string); if (test_buf[index] == 1) return 0; test_buf[index] = 1; string++; } return 1; } END_C run_test("0123456789", "UNIQUE:\n"); run_test("1123456789", "NOT_UNIQUE:\n"); sub run_test { my ($digits, $message) = @_; prime_variables($digits, "0"); print $message; cmpthese( -5, { u1 => sub { for ( 0 .. 9 ) { return 0 if $digits !~ /$_/; } return 1; }, u1000 => sub { test_uniq_digits }, }); print "\n\n"; }

    Here's the output:

    UNIQUE: Rate u1 u1000 u1 10619/s -- -98% u1000 692527/s 6421% -- NOT_UNIQUE: Rate u1 u1000 u1 414417/s -- -42% u1000 712019/s 72% --
    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com

      Unless your XS version is quicker than my Inline C algorithm:

      __DATA__ __C__ int uniq( char* s ) { int i; int t = 0; for( i=0; i<10; i++ ) { t |= 1 << ( s[i] - 48 ); } return t == 1023; }

      which doesn't seem likely, then the big hash lookup will still be faster,


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
        With the caveat that your algo needs a variable offset or it will fail on EBCDIC machines, I like it better than mine. Your first and your last solutions are the best two on the page, IMHO.

        WRT the giant hash lookup: "That's nice, honey. Did you remember to feed the dog?" ;)

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://496566]
Approved by sk
Front-paged by monkfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (9)
As of 2014-12-26 01:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (163 votes), past polls