Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^7: How to reverse a (Unicode) string

by Anonymous Monk
on Jan 31, 2011 at 15:49 UTC ( #885298=note: print w/ replies, xml ) Need Help??


in reply to Re^6: Repurposing reverse
in thread How to reverse a (Unicode) string

So we're in agreement, the documentation for reverse needs to be updated to clarify what it does, right?

perlunicode
And finally, scalar reverse() reverses by character rather than by byte.
perldoc -f reverse
In scalar context, concatenates the elements of LIST and returns a string value with all characters in the opposite order.


Comment on Re^7: How to reverse a (Unicode) string
Re^8: Repurposing reverse
by ikegami (Pope) on Jan 31, 2011 at 16:10 UTC

    Second one first: I've already asked Jim for what youhe thinks would be a clearer name for an element of a string element. But,

    The first one is a counter argument for what I said. With that, one could declare reverse to be buggy because it should be reversing text, and make it so it reverses sequences of graphemes, which is a much more common use case.

    Update: Didn't realize I wasn't replying to Jim. Fixed.

      I'm an Anonymous Monk who suggested someone submit perlbug report -- i don't really follow everything you two are saying

        I intend to submit a perlbug report. I've never used perlbug before, so I want to take the time to do it right and not botch it. I get the sense the Perl maintainers don't suffer fools lightly, so I don't want to be foolish.

        The only thing I am saying in this discussion is that several of Perl's built-in string functions (e.g., reverse) are not sufficiently Unicode-conforming. Today, a modern scripting language whose traditional strength is text processing needs to be fully capable of handling the complex richness of Unicode with aplomb. An important example of something a modern scripting language needs to be able to do is this: When it reverses Unicode text, it must do it correctly by grapheme clusters, not incorrectly by code points.

Re^8: How to reverse a (Unicode) string
by Jim (Curate) on Jan 31, 2011 at 18:32 UTC

    The problem isn't one of characters versus bytes. The problem is the definition of character in the context of Unicode text. The scalar reverse function and other built-in string functions operate on Unicode text using a naÔve and inadequate definition of character. Pointing this out and offering a workaround is the raison d'Ítre of moritz's 2008 tutorial.

    The issue of what reverse does when fed, say, the bytes of a JPEG image are utterly irrelevant to this discussion, which is about Unicode text. I don't understand ikegami's insistentence on trying to fold into this discussion unrelated contexts. Your reply dramatizes how ikegami's contrarian non sequitur needlessly confused the simple and self-evident conclusion I made in my post.

    Here's what I wrote:

    The documentation of Perl's reverse function states: "In scalar context, [the reverse function] ... returns a string value with all characters in the opposite order." But it doesn't, at least not for a sufficiently modern, multilingual, Unicode-conformant definition of "character." It reverses Unicode code points, not characters in the usual, well-understood sense of the word.
    One or the other is wrong: the behavior of the reverse function or the reverse function's documentation.
    If I understand the design principles of Perl correctly, the reverse function should properly reverse extended grapheme clusters when the thing being reversed is Unicode text (and Perl understands it is Unicode text), and it should reverse bytes otherwise.

      The problem is the definition of character in the context of Unicode text.

      No, I fully agree with you with the definition of character in the context of Unicode text.

      At issue is that reverse cannot recognise the presence of Unicode text. How do you think reverse can tell the difference between chr(113).chr(101).chr(769) and "qe\N{COMBINING ACUTE ACCENT}"?

      It can either always treat the string as Unicode text, or never. Currently, it never does. To change that is backwards incompatible, so you'd have to demonstrate a bug in order to change that behaviour.

      samples text ------- ------- current ok not ok string ok not ok unicode not ok ok

      Your whole argument for the presence of a bug is that reverse uses "character" could be confused with Unicode's definition of the word.

      One or the other is wrong: the behavior of the reverse function or the reverse function's documentation.

      Those are the only two options if and if reverse's documentation uses the same definition of "character" as the Unicode standard.

      Update: Added code.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://885298]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (11)
As of 2014-08-01 17:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (36 votes), past polls