Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Generating Unique numbers from Unique strings

by basiliscos (Pilgrim)
on Apr 03, 2016 at 13:28 UTC ( [id://1159428]=note: print w/replies, xml ) Need Help??


in reply to Generating Unique numbers from Unique strings

Indeed, you can use hashing for your string, and then compare by hash-value. If the hash result does not match, that means that stings does not match too; but it hash matches, the full comparison for strings should be performed

You don't have to use cryptography hashes (i.e. MD, SHA), I can recommend you to use Digest::MurmurHash

WBR, basiliscos.

Replies are listed 'Best First'.
Re^2: Generating Unique numbers from Unique strings
by Marshall (Canon) on Apr 03, 2016 at 14:25 UTC
    If the hash table is like: $hash{"String"}=1;, then if there is a hash collision, Perl compares the strings. All you need to know is if the string's key exists or is defined. Leave the "1" value out of the test, just look at the key. Different strings can indeed hash to the same hash value (which means hash "bucket number"). So a bucket can have more than one string/value pair.

    At one time, I knew the hash doubling algorithm, but I've forgotten and it also periodically changes. But when "too many" strings hash to the same bucket, Perl will increase the hash size to make more buckets and spread things out more. This all happens automatically and is fast.

      You are completely right: all the hash complexities are handled by perl. I've mention that to let the author of the original question got an impression, how it works, and, implement his own logic, if the standard hashes aren't enough for him.

      BTW, does perl caches the string hash value? I.e. is the following approach:

      my $index = $hash->{$string}; ... my $value1 = $array1->[$index]; my $value2 = $array2->[$index]; ... my $valueN = $arrayN->[$index];

      comparing to the following approach:

      my $value1 = $hash1->{$string}; my $value2 = $hash2->{$string}; ... my $valueN = $hashN->{$string};

      I.e. I know that number of keys is fixed and isn't changed at runtime, so, I sort them, and store their order in hash, and then in future I always use arrays to store/retrieve values associated with string.

      Is the first approach faster than the second?

      WBR, basiliscos.
        You asked: "BTW, does perl caches the string hash value?". No, I don't think so. And there is no need to do so because the C equation that calculates the integer hash value runs like a rocket! I've got C code that uses and benchmarks 2 of the recent Perl hash functions, but the computer that it is on, is dead right now! Bummer. To calculate the hash function is very, very fast. I experimented with different C formulations of the equation and found that gcc even on lowest optimize setting coded them essentially the same. On a modern Intel processor even a int*2 is as fast as a left shift 1. I was amazed, but decades of development have gone into that thing and literally millions of transistors!

        Before writing extra code that you think will make a performance difference, benchmark the code and see where you are at. Start with the most straight forward HLL code that implements a reasonable algorithm. Use the features of the language because they have been highly optimized and are likely to produce good results with a clear algorithm.

        An array access by index will be faster than a hash table lookup, but not by all that much unless you do this a bazillion times. BenchMark your code and see for yourself.

        Update: in my experience the 80/20 or even the 90/10 rule applies. 10% of the code does 90% of the "real work". Forget about optimizing the 90%, you must find the 10% where it really matters.

        > Is the first approach faster than the second?

        What prevents you from writing a short script using Benchmark and finding out yourself?

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks Bascillios and Marshall. I will check the string as hash key approach

      Honestly speaking i did not think of keying my incoming strings in hashes as i was not sure of performance hit when doing an if(exists $storage{"String"}), as if we can generate a unique number from String and then   if(exists $storage{uniq number}) should be obviously faster the searching a big string as hash key. But anyways i will try it out

        You will find that this makes no difference at all. A 200,000 entry hash is actually "not big" in the scheme of things. I have often worked with hashes that big. The hash algorithm calculates a number based upon the string. In Perl, the hash table itself is generated in powers of 2. So the hash "bucket" is just a matter of masking off X bits from this calculated hash integer and that essentially is used as an array index. If there are multiple strings that hashed to the same "bucket", then there are some string comparisons done. All of this code is written in C and runs really fast. Because of the way that the hash function is chosen, even very similar looking strings will have very different hash values. The calculation of this hashing integer is lighting fast, far faster than what you would need to do to get an absolutely unique value.

        Do some performance testing with your actual data and report back. I think you will be impressed and the code will be short.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1159428]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-28 22:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found