http://www.perlmonks.org?node_id=881423


in reply to Re^4: How to reverse a (Unicode) string
in thread How to reverse a (Unicode) string

You seem to be confusing "1-to-1" mapping, and "identity function". While the identity function is a trivial "1-to-1" mapping, it's not true every "1-to-1" mapping is the identity function.

However, even side-stepping that, Juerd doesn't mean byte values map 1-to-1. The mapping is after decoding. For instance, the UTF-8 byte sequence 0x82 0xC3 decodes to C2. Which indeed does map to the C2 Unicode code point.

Replies are listed 'Best First'.
Re^6: How to reverse a (Unicode) string
by ikegami (Patriarch) on Jan 10, 2011 at 15:50 UTC

    In that case, we're back to the original question. Are there any encodings aren't "Unicode encodings"?

    (Strictly speaking, the mapping isn't 1-to-1. U+2660 can't be encoded in iso-8859-1. You could also say that both U+00E9 and U+0065 U+0301 encode to E9 in iso-8859-1, although Encode's encode doesn't handle that.)

      Strictly speaking, the mapping isn't 1-to-1. U+2660 can't be encoded in iso-8859-1
      The claim is that iso-8859-1 maps 1-to-1 to Unicode, not that Unicode maps 1-to-1 to iso-8859-1. A 1-to-1 mapping is also known as an injection. The claim wasn't that it's a bijection (aka 1-to-1 correspondence).
Re^6: How to reverse a (Unicode) string
by ikegami (Patriarch) on Jan 10, 2011 at 16:15 UTC
    No, actually, I'm not confused. When the term was introduced, it was given as the reason iso-8859-1 works without being decoded, so he indeed meant an identity mapping.
      You have to always decode. Note that Unicode is a list of integers with a meaning. iso-8859-1 is an encoding (of a subset of Unicode). UTF-8 is also an encoding. UTF-16 is another. It just happens that for the first 128 code points, the encoding in iso-8859-1 and UTF-8 are identical. But that wasn't part of Juerds claim.

        You have to always decode.

        No, you don't have to with US-ASCII and iso-8859-1.

        But that wasn't part of Juerds claim.

        I agree. He didn't mention any relation between the first 128 characters of iso-8859-1 and UTF-8. No idea why you bring this up.

        iso-8859-1 is an encoding (of a subset of Unicode)

        Unicode is a character set, not an encoding, so that sentence is broken.

        iso-8859-1 is both a character set and an encoding. The iso-8859-1 character set is a subset of the Unicode character character set, but this property does NOT explain why iso-8859-1 works without being decoded.