Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

number of unique characters in a string

by Anonymous Monk
on Mar 01, 2003 at 03:35 UTC ( #239632=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

OK, I have a nice, easy question that will let you show off your cleverness (I could do this easily by brute force, but I want some efficiency, as I will have to do it thousands of times).

I have some random strings (actually hex strings, but I would like to see a more general case). I want to find the number of unique characters in the string.

f(0101010101) should return 2
f(1234567812) should return 8

Comment on number of unique characters in a string
Re: number of unique characters in a string
by Limbic~Region (Chancellor) on Mar 01, 2003 at 03:54 UTC
    Here is OWTDI - but I bet by the time I finish this post, some other monk will come up with a faster, more Perlish way.

    #!/usr/bin/perl -w use strict; print UniqueCount("1234567812") . "\n"; sub UniqueCount { my $string = shift; if ($string) { my %unique; return grep { !$unique{$_}++ } split //, $string; } else { return 0; } }

    You will want to modify if you do not want to count spaces or if you want to be case insensitive

    Hope this helps - happy hacking - L~R

    Update:Dropped un-needed @array in grep

Re: number of unique characters in a string
by OM_Zen (Scribe) on Mar 01, 2003 at 04:00 UTC
    Hi ,

    The Hash can hold only unique keys and hence you can store the key as the split string .

    my $txt1 = "0101010101"; my $count; my %hashofuniq; my @txt1 = split ('',$txt1); @hashofuniq{@txt1} = 1; foreach (keys %hashofuniq){ $count++; } print "[$count]\n";


      You're on the right track, but I think this could be made more perl-ish again ...

      my $text = '0101010101'; my %hash; @hash{ split '', $text } = 1; print scalar keys %hash, "\n";

       

      perl -le 'print+unpack("N",pack("B32","00000000000000000000001000110111"))'

        ++ for teaching me a new hash syntax tidbit: I never knew about the

        @hash{@list_of_keys} = @list_of_values;

        syntax. Thanks!

        my $text = '0101010101'; my %hash; @hash{ split '', $text } = 1; print scalar keys %hash, "\n";
        Although this code above will work correctly, it will not be doing so the way that you expected which could lead you to trouble.
        @hash{@array} = $scalar
        only sets $scalar as the value for the first key in %hash. To set all the keys, you'd want to do something like:
        @hash{@array} = ($scalar) x scalar @array;
        In the actual code above, since you don't check values anyway, you can just write:
        @hash{@array} = ();
Re: number of unique characters in a string
by pfaut (Priest) on Mar 01, 2003 at 04:10 UTC
    sub f { my $l = ''; my $u; for (sort split(//,$_[0])) { $u++,$l=$_ if ($_ ne $l); } return $u; }

    Update: I posted this because it was faster than a hash based solution but OM_Zen's idea with rob_au's adjustment is even faster. I'd bench merlyn's Inline::C version, but I don't have Inline::C installed...

    --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';
•Re: number of unique characters in a string
by merlyn (Sage) on Mar 01, 2003 at 04:14 UTC
    Well, since you said you're going to do this thousands of times...
    use Inline C =>; print UniqueCount("0101010101"), "\n"; print UniqueCount("1234567812"), "\n"; __END__ __C__ int UniqueCount(unsigned char *str) { char counts[256]; int i; int result; /* clear the array */ for (i = 0; i <= 255; i++) { counts[i] = (char) 0; } /* notice the characters */ while (*str) { counts[*str++] = 1; } /* now count the results */ result = 0; for (i = 0; i <= 255; i++) { result += counts[i]; } return result; }
    Apologies for my rusty C... I've been coding in Perl too long (however, it worked on the first compile!).

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: number of unique characters in a string
by Coruscate (Sexton) on Mar 01, 2003 at 04:16 UTC

    How about something like this?

    my $txt = "abcdecfd"; my %same = map { $_, 1 } split //, $txt; print scalar keys %same, ' unique: [', keys %same, "]\n";


    If the above content is missing any vital points or you feel that any of the information is misleading, incorrect or irrelevant, please feel free to downvote the post. At the same time, reply to this node or /msg me to tell me what is wrong with the post, so that I may update the node to the best of my ability. If you do not inform me as to why the post deserved a downvote, your vote does not have any significance and will be disregarded.

Re: number of unique characters in a string
by tall_man (Parson) on Mar 01, 2003 at 04:38 UTC
    No one has tried with regular expressions yet, so:
    sub f { my $txt = join('', sort(split //,shift)); $txt =~ s/(.)\1+/$1/g; return length($txt); }

      You could replace the s/// with tr///s

      sub f { my $txt = join('', sort(split //,shift)); $txt =~ tr/\x00-\xff//s; return length($txt); }

      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.

      This thread is a bit old, but here is my take on the problem using only a regex. I needed to check a string that was entered as a new password for several format restrictions including having at least five different characters. I was adding this to existing code that expected a compiled regex to test the string so I didn't have the option of using additional commands. The matching regex just had to succeed or fail.

      use Data::Dumper; my $unqchar5_regex = qr/^(?{%_=()})(?:(.)(?{$_{$1}++}))+ (??{(scalar(keys %_)<5)?~$1:''})$/x; my $pswd = "abcdef"; print (($pswd =~ $unqchar5_regex)?"Pass\n":"Fail\n"); print Dumper \%_; my $pswd = "abcdbd"; print (($pswd =~ $unqchar5_regex)?"Pass\n":"Fail\n"); print Dumper \%_;

      The value of %_ is just kind of a bonus because rather than just setting the char as a hash element I increment it so you end up with a character count. I use %_ because it is defined globally by default. If you are afraid of collisions with its use then you can give the hash a different name but you'll have to define that variable somewhere in the code if 'use strict vars' is on.

      Here's what is going on:

      qr/^                             # Beginning of string.
          (?{%_=()})                   # Clear the counting hash (zero width op).
          (?:                          # Group the matching of the character with the setting 
                                       #   of the count hash, but don't collect the value.
             (.)                       # Match just one char and collect it.
             (?{$_{$1}++})             # Use the char as the key in the hash and count it (zero width op).
          )+                           # Do the collect and count for as many chars as we have.
          (??{                         # Eval this code and use its val as a pattern (zero width op).
              (scalar(keys %_)<5)      # Perform test for number of unique characters.
                                 ?~$1  # If not what we want, fowl the pattern to make the match fail.
                                 :''   # Otherwise don't change the pattern so match succeeds.
             })
         $/x;                          # Anchor end of line (x to break up the pattern).
      

      Really any boolean test can be performed on the hash. Just have it evaluate to blank ('') if the regex should succeed, ~$1 if it should fail. The tilde (~) on ~$1 is the bitwise compliment operator, therefor whatever character (byte value) is in $1, ~$1 is guaranteed to NOT match. I needed to do this because my list of valid characters was ALL characters.

      Here is a version with some debugging output:

      my $unqchar5_regex = qr/^(?{%_=()})(?:(.)(?{$_{$1}++}))+ (?{warn Dumper \%_}) (??{warn scalar(keys %_)." $1\n"; (scalar(keys %_)<5)?~$1:'' })$/x;

      If you run this you will see that the single char collection steps through and matches the entire string. Then if the boolean test doesn't add to the pattern, it's done, all matched, success. Otherwise if the test expression adds the value of ~$1 to the pattern, we are out of characters so the match is failing, but the regex engine backs up the string to be sure. Since it can't make a match (because $1 != ~$1) it fails.

      In the end, we ended up not using this code at all because the password cracking tester did it. :P Hopefully this will be of use to someone.

Re: number of unique characters in a string
by BrowserUk (Pope) on Mar 01, 2003 at 04:49 UTC

    Updated To correct error introduced whilst sanitising for public consumption.

    It would be interesting to see how this would compare with merlyn's Inline::C version for performance. Unfortunately, I still don't have that ability (yet).

    One point to note, this version should also work for unicode strings. If you don't need that, then swapping the unpack template from 'U*' to 'C*' should do the trick and might be a tad quicker.

    #! perl -slw use strict; sub nUniqC{ my @uniq; ## Updated. @uniq[unpack 'U*',$_[0]] = (1)x length $_[0]; scalar (grep defined $_, @uniq); } print nUniqC '1010101010'; print nUniqC '1234567812';

    For speed in perl and assuming no utf-8, this seems a bit quicker.

    sub nUniqC{ my @uniq; scalar grep{ ++$uniq[$_] == 1 } unpack('C*',$_[0]); }

    Update Trimmed another 15% of fat. No I didn't. I misread bad data, the version above is quickest.

    sub nUniqC{ scalar grep{ ++$_[$_] == 1 } unpack('C*',$_[0]); }

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

        Limbic~Region++ for catching my stupidity.

        I updated. Try it now if you have the time. I know Merlyn's will scream relatively, it would be interesting to see just how much quicker it is? Not having had the chance to make this sort of comparison means I have no handle on the differences.


        ..and remember there are a lot of things monks are supposed to be but lazy is not one of them

        Examine what is said, not who speaks.
        1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
        2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
        3) Any sufficiently advanced technology is indistinguishable from magic.
        Arthur C. Clarke.
Re: number of unique characters in a string
by Limbic~Region (Chancellor) on Mar 01, 2003 at 05:32 UTC
    I intend to benchark all of these 1 Mar 03, and update this node with the results - I found this quite interesting and will include all unique methods in the bench.

    Cheers - L~R

    UPDATE:

    I decided not to bench the results of the methods I found on the Internet here, because the results were already not fitting on the screen correctly. Since merlyn is the obvious winner, I didn't feel it would add that much value, though you are welcome to do your own bench.

    The first thing I did was create a data file consisting of 1000 lines of random ASCII strings between 10 and 250 characters long (also randomly determined). The code I used to do this isn't perfect, but it can be found here:

    The next thing I did was to benchmark 150 iterations, which would actually yields 150,000 attempts at various string lengths. I had to modify the methods just a little bit to work since I would be reading from a file instead of passing a parameter to a sub - if you disagree with how I modified your entry, let me know and I will update it. You can see the bench.pl script here:

    And, what you have all been waiting for - the results:

    I am just glad mine wasn't the slowest since I posted first.
    Cheers - L~R

    Update 2:
    Per BrowserUk's suggestion, I have removed the @_ = $_; assignments in all but Br_UK-5. I could not figure out how to make this work the first time around and originally decided to make them uniform. It doesn't appear that this has affected his results at all. As far as his comments concerning version 3 being faster than version 4, I can only postulate that it is because of the variance in the data samples. I do not feel that testing the same string repeatedly is a fair test since one could memoize the function making it much faster (maybe even fast enough to compete with merlyn *grin*). I will forward anyone a copy of my data file that would like to test my results on their platform. The most current code and results are posted. L~R

      Nice benchmark.

      I really surprised that you measured Buk-3 as quicker tha Buk-4, that contradicts mine own testing and Jenda's, which is strange.

      Also, you are unfairly penalising those routines where you assign, @_=$_; better to substitute $_ for shift or $_[0] as appropriate.

      All inconsequential in the light of the speed of the Inline::C version, unless you haven't got Inline::C available to you:)


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.
Re: number of unique characters in a string
by hgolan30 (Initiate) on Mar 01, 2003 at 10:33 UTC
    use Data::Dumper; my $number=1234567812; my @tmp=split('',$number); my %count; foreach(@tmp){ $count{$_}=$count{$_}+1; } print Dumper \%count;
BENCHMARK Re: number of unique characters in a string
by Jenda (Abbot) on Mar 01, 2003 at 18:37 UTC

    I tried to come up with my own solutions and benchmark them against the solutions proposed by others. I did expect the Inline::C version to be much quicker, but the difference is even bigger than I expected.

      I expected 5x or maybe 10x faster for the C version but 22...

      One thing to note. The original version of nUniqC() that uses the local array is quicker than the one (re)-using @_ despite my claims for the latter. Turned out it was an artifact of my own testing that gave me false readings. Not that it is going to get even close to merlyn's.

      It's good to get a feel for the difference, and will make me renew my efforts to build my own version of Perl so that I will have that facility available to me.


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.

        I should have tried that. I did change one of mine from using a lexical array to @_ and noticed a slowdown, but did not think about trying to modify nUniqC().

        Shame Inline::C doesn't play well with PerlApp (yet?). I could use it as an "inline assebler" as I once did in Pascal if this worked. (Is there anyone who did not design and implement his/her own menu&window system for DOS/character terminal? :)

        Actually ... does Inline::C work with PAR? I mean does PAR take the compiled C code and include the compiled library in the archive?

        Jenda

Re: number of unique characters in a string
by perlguy (Deacon) on Mar 04, 2003 at 16:12 UTC
    similar to some of the above requests, but how about a quick and easy:
    print number_of_uniques('ppppppperlguy') . "\n"; sub number_of_uniques { my %s; return scalar (grep !$s{$_}++, split '', $_[0]); }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://239632]
Approved by Limbic~Region
Front-paged by tall_man
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2014-09-20 19:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (160 votes), past polls