How can I get a numeric representation of a utf8 encoded string?
Essentially, you cannot for strings of any usable length.
Unicode strings stored in computers are essentially very big numbers stored base131072 (if you stick to just the very basic multilingual subset).
With decimal numbers encoded as a string (eg. '12345'), each digit can be 09, so the size of the number grows by a factor of 10 for each extra digit. So by the time you've got 20 digits, you've exhausted the capacity of 64bit integers. And if you move to floating point, you start loosing accuracy after just 15 digits.
For hex numbers encoded as a string, each extra digit adds a factor of 16, so you exhaust 64bits ints with only 16 digits.
With Unicode you have 128k for each digit, so by the time you've got a 4 character string you've exceeded the capacity of a 64int by a factor of 10.
So basically, you can give up on the idea of an accurate representation of a string by a number.
Then you move into the realm of 'lossy' representations. And there is a whole science (and a lot of bunkum) that attempts to produce 'comparative' numerical values from documents  numbers derived from text that will when sorted numerically tend to group the documents by similarity. These are used for applications such as plagiarism detection. A simple starting point which may lead you in many directions.
Another approach might be to use a 'running checksum' to detect sections of similarity. For that approach the Rsynch algorithm is a useful starting point.
For the most part, if your goal is simply to save space in your db, you'd probably be better off using a simple compression algorithm.
Update: for completeness, you might find Re^3: Comparing sets of phrases stored in a database? enlightening.
