|Problems? Is your data what you think it is?|
Re: Numeric representation of stringsby BrowserUk (Pope)
|on Apr 24, 2014 at 14:47 UTC||Need Help??|
How can I create unique integers to represent each string?
It's easy. The character representation of a string already is a unique numeric representation of that string.
Of course, the problem is that using the largest integers easily available to us -- 64-bit -- we can only accommodate the first 8 bytes of the string. After that, the value of the extra characters no longer fit into the 64-bit representation and so words like aardvark and aardvarks are indistinguishable.
There are modules on cpan that allow us to use larger integers -- salva has 128 & 256-bit integer modules -- which would double and quadruple the length of strings we can uniquely accommodate, but then the second problem becomes obvious. The numeric representations take up just as much space as the strings they represent. So nothing is gained.
Another possibility arises if the alphabet used for the strings is smaller than the character code representation it uses. Ie. If the string only use (say) lower-case alpha characters and digits, then those 36 characters can be represented in 6-bits, so it becomes possible to pack the characters into the number so that a 64-bit integer can represent 10 byte strings rather than 8 byte strings; but as you can see the gains are minimal.
And finally, there is the possibility of a hashing allgorithm to create a digest of the string. For example, the MD5 digest algorithm will 'reduce' strings of any length to a single, 128-bit number.
But, just as the pack based integerisation shown above mapped multiple strings to each number, the same is true of every digest/hashing algorithm in existence. So, any algorithm or logic based around using numerical digests to discover or ensure string uniqueness must be designed to accommodate the possibility of duplicate digests from disparate inputs.
Duplicates are a simple fact of life for all digest algorithms.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.