Re^8: How to reverse a (Unicode) string

in reply to Re^7: How to reverse a (Unicode) string
in thread How to reverse a (Unicode) string

The problem isn't one of characters versus bytes. The problem is the definition of character in the context of Unicode text. The scalar reverse function and other built-in string functions operate on Unicode text using a naďve and inadequate definition of character. Pointing this out and offering a workaround is the raison d'ętre of moritz's 2008 tutorial.

The issue of what reverse does when fed, say, the bytes of a JPEG image are utterly irrelevant to this discussion, which is about Unicode text. I don't understand ikegami's insistentence on trying to fold into this discussion unrelated contexts. Your reply dramatizes how ikegami's contrarian non sequitur needlessly confused the simple and self-evident conclusion I made in my post.

Here's what I wrote:

The documentation of Perl's reverse function states: "In scalar context, [the reverse function] ... returns a string value with all characters in the opposite order." But it doesn't, at least not for a sufficiently modern, multilingual, Unicode-conformant definition of "character." It reverses Unicode code points, not characters in the usual, well-understood sense of the word.

One or the other is wrong: the behavior of the reverse function or the reverse function's documentation.

If I understand the design principles of Perl correctly, the reverse function should properly reverse extended grapheme clusters when the thing being reversed is Unicode text (and Perl understands it is Unicode text), and it should reverse bytes otherwise.

Comment on Re^8: How to reverse a (Unicode) string Select or Download Code

Replies are listed 'Best First'.
Re^9: Repurposing reverse by ikegami (Patriarch) on Jan 31, 2011 at 19:01 UTC
The problem is the definition of character in the context of Unicode text. No, I fully agree with you with the definition of character in the context of Unicode text. At issue is that `reverse` cannot recognise the presence of Unicode text. How do you think `reverse` can tell the difference between `chr(113).chr(101).chr(769)` and `"qe\N{COMBINING ACUTE ACCENT}"`? It can either always treat the string as Unicode text, or never. Currently, it never does. To change that is backwards incompatible, so you'd have to demonstrate a bug in order to change that behaviour. Read more... (1514 Bytes) `samples text ------- ------- current ok not ok string ok not ok unicode not ok ok` [download] Your whole argument for the presence of a bug is that `reverse` uses "character" could be confused with Unicode's definition of the word. One or the other is wrong: the behavior of the reverse function or the reverse function's documentation. Those are the only two options if and if `reverse`'s documentation uses the same definition of "character" as the Unicode standard. Update: Added code.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^9: Repurposing reverse
by ikegami (Patriarch) on Jan 31, 2011 at 19:01 UTC

The problem is the definition of character in the context of Unicode text.

No, I fully agree with you with the definition of character in the context of Unicode text.

At issue is that reverse cannot recognise the presence of Unicode text. How do you think reverse can tell the difference between chr(113).chr(101).chr(769) and "qe\N{COMBINING ACUTE ACCENT}"?

It can either always treat the string as Unicode text, or never. Currently, it never does. To change that is backwards incompatible, so you'd have to demonstrate a bug in order to change that behaviour.

Read more... (1514 Bytes)

         samples  text
         -------  -------
current  ok       not ok
string   ok       not ok
unicode  not ok   ok
[download]

Your whole argument for the presence of a bug is that reverse uses "character" could be confused with Unicode's definition of the word.

One or the other is wrong: the behavior of the reverse function or the reverse function's documentation.

Those are the only two options if and if reverse's documentation uses the same definition of "character" as the Unicode standard.

Update: Added code.

[reply]
[d/l]
[select]

In Section Tutorials