Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Numeric representation of strings

by BrowserUk (Pope)
on Apr 24, 2014 at 14:47 UTC ( #1083619=note: print w/replies, xml ) Need Help??

in reply to Numeric representation of strings

How can I create unique integers to represent each string?

It's easy. The character representation of a string already is a unique numeric representation of that string.

C:\test>perl -nle"printf qq[%-20s : %u\n], $_, unpack 'Q', pack 'Z8', +$_" words.txt aa : 24929 aah : 6840673 aahed : 431198069089 aahing : 113723912511841 aahs : 1936220513 aal : 7102817 aalii : 452740276577 aaliis : 126896577470817 aals : 1936482657 aardvark : 32195308464267617 aardvarks : 32195308464267617 aardwolf : 30521856061759841 aardwolves : 30521856061759841 aargh : 448412148065 aarrgh : 114793511018849 aarrghh : 29388191088927073 aas : 7561569 aasvogel : 28542701074080097 aasvogels : 28542701074080097 ab : 25185 aba : 6382177 abaca : 418279154273 abacas : 126862116348513 abaci : 452638892641 aback : 461228827233 abacterial : 32199697902953057 abacus : 126948015694433 abacuses : 28555920663470689 abaft : 499933864545 abaka : 418413372001 abakas : 126862250566241 abalone : 28550397486522977 abalones : 28550397486522977 ...

Of course, the problem is that using the largest integers easily available to us -- 64-bit -- we can only accommodate the first 8 bytes of the string. After that, the value of the extra characters no longer fit into the 64-bit representation and so words like aardvark and aardvarks are indistinguishable.

There are modules on cpan that allow us to use larger integers -- salva has 128 & 256-bit integer modules -- which would double and quadruple the length of strings we can uniquely accommodate, but then the second problem becomes obvious. The numeric representations take up just as much space as the strings they represent. So nothing is gained.

Another possibility arises if the alphabet used for the strings is smaller than the character code representation it uses. Ie. If the string only use (say) lower-case alpha characters and digits, then those 36 characters can be represented in 6-bits, so it becomes possible to pack the characters into the number so that a 64-bit integer can represent 10 byte strings rather than 8 byte strings; but as you can see the gains are minimal.

And finally, there is the possibility of a hashing allgorithm to create a digest of the string. For example, the MD5 digest algorithm will 'reduce' strings of any length to a single, 128-bit number.

But, just as the pack based integerisation shown above mapped multiple strings to each number, the same is true of every digest/hashing algorithm in existence. So, any algorithm or logic based around using numerical digests to discover or ensure string uniqueness must be designed to accommodate the possibility of duplicate digests from disparate inputs.

Duplicates are a simple fact of life for all digest algorithms.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1083619]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2017-08-23 08:37 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (350 votes). Check out past polls.