Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Repurposing reverse

by ikegami (Pope)
on Jan 24, 2011 at 06:50 UTC ( #883864=note: print w/ replies, xml ) Need Help??


in reply to Re: How to reverse a (Unicode) string
in thread How to reverse a (Unicode) string

It reverses Unicode code points

No. It doesn't know anything about Unicode and there's no requirement for the string to be Unicode text.

not characters in the usual, well-understood sense of the word.

In my experience, "character" is the constituent element of a string, and never a grapheme except by happenstance. Let's just say there is no such consensus. Regardless, that's the definition used here.

If I understand the design principles of Perl correctly, the reverse function should properly reverse extended grapheme clusters

No. reverse provides a vital string operation. It should not assume the string is Unicode text. Reversing text is also a useful function, but it is not provided by reverse.

Same goes with substr and length.

when the thing being reversed is Unicode text (and Perl understands it is Unicode text), and it should reverse bytes otherwise.

Perl doesn't have a means of "understanding a string is Unicode text".


Comment on Re^2: Repurposing reverse
Select or Download Code
Re^3: How to reverse a (Unicode) string
by Jim (Curate) on Jan 31, 2011 at 08:10 UTC
    The documentation of Perl's reverse function states: "In scalar context, [the reverse function] ... returns a string value with all characters in the opposite order."
    No. It doesn't know anything about Unicode and there's no requirement for the string to be Unicode text.

    The documentation is quite clear: In scalar context, the reverse function operates on strings. If the string is Unicode, it reverses Unicode code points. If the string is in some single-byte character encoding such as ISO 8859-1 (Latin 1), then it reverses those characters. It's really very straightforward.

    You're once again trying to make some esoteric point about the distinction between strings and bytes, and what is Unicode and what is not Unicode. But your peculiar, persistent point isn't relevant here.

    Nothing in my post is incorrect or inaccurate, yet the supercilious tone of your response wrongly implies that something is incorrect. The topic of this discussion is Unicode text, so of course I'm talking about Unicode text in it.

    In my experience, "character" is the constituent element of a string, and never a grapheme except by happenstance. Let's just say there is no such consensus.

    What is a character in a language is well-understood and rarely, if ever, subject to debate. In the case of Unicode, "character" is well-defined, too: It's a "grapheme." It's as simple as that. Read the Unicode Standard.

    There are four characters in the word "Café", not five. When you arrange the four characters in the opposite order, you get "éfaC". This fact is the very basis of this tutorial and discussion.

    No. reverse provides a vital string operation. It should not assume the string is Unicode text.

    Your insistence that there's a difference between "string" and "text" may have some strange basis in the arcane details of the internals of Perl, but it has no relevance to this discussion. The whole premise of this discussion is that we're trying to reverse Unicode strings (Unicode text). And the salient point about Perl's reverse function is that it fails to reverse properly an infinite number of possible Unicode strings.

    Reversing text is also a useful function, but it is not provided by reverse.

    This is a bizarre and incomprehensible statement.

    What's the difference between a "string" and "text" to someone writing a Perl program?

    Explain why you think this…

    use utf8;
    
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $Moonshine = "Rươòu ðêì";
    my $enihsnooM = reverse $Moonshine;
    
    print "$Moonshine\n";
    print "$enihsnooM\n";
    

    …should produce different output than this…

    use utf8;
    
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $Moonshine = "Rươòu ðêì";
    my $enihsnooM = join '', reverse $Moonshine =~ m/\X/g;
    
    print "$Moonshine\n";
    print "$enihsnooM\n";
    

      The documentation is quite clear: In scalar context, the reverse function operates on strings

      Yes.

      You're once again trying to make some esoteric point about the distinction between strings and bytes,

      No. Quite the opposite, I'm saying there is no distinction. reverse doesn't know or care what the string is, and has no way of knowing. It will reverse the characters (elements?) of the string.

      What is a character in a language is well-understood and rarely, if ever, subject to debate

      You opened the debate! Again, a common CS definition is used here: A string is a sequence of elements named characters. Would you care to suggest an alternative?

      What is a character in a language is well-understood and rarely, if ever, subject to debate. In the case of Unicode, "character" is well-defined, too: It's a "grapheme." It's as simple as that. Read the Unicode Standard.

      Unicode has four definitions for "character", and none correspond exactly to that of "grapheme".

      But it's irrelevant. Again, reverse function isn't to manipulate text.

      There are four characters in the word "Café", not five.

      Yes, but irrelevant. reverse's function isn't to manipulate words. There are five characters in the string chr(0x43).chr(0x61).chr(0x66).chr(0x65).chr(0x301).

      Your insistence that there's a difference between "string" and "text" may have some strange basis in the arcane details of the internals of Perl,

      Again, quite the opposite. It has nothing to do with Perl internals. It's not even specific to Perl. A string is a data type. Text is one of many things that can be stored in a string.

      my $x = "abcd"; # String? yes. Text? my $host = inet_ntoa($x); # No, "packed" IP address
      my $x = "abcd"; # String? yes. Text? print("The password is $x\n"); # Yes.

      You're suggesting that Perl should reverse strings as if they're text. For example, you say

      chr(0x65).chr(0x301)

      should return

      chr(0x65).chr(0x301)

      But that's wholly inappropriate for water level measurements or for anything else the string might be.

      There is need for a function that reverses strings (chr(101).chr(769) ⇒ chr(769).chr(101)). There is also a need for a function that reverses text (chr(101).chr(769) ⇒ chr(101).chr(769)). reverse does the former.

        You're just blowing the same old tired, incomprehensible smoke here as in so many other discussions on PerlMonks about Unicode. Until you can make a compelling case for why this Perl code…

        use utf8; binmode STDOUT, ':encoding(UTF-8)'; print reverse "Réaliste";

        …should produce different output than this Perl code…

        use utf8; binmode STDOUT, ':encoding(UTF-8)'; print join '', reverse "Réaliste" =~ m/\X/g;

        …you're just arguing for the sake of argument about esoteric matters that aren't relevant at all to the topic at hand.

      As for the question you added,

      Explain why you think this……should produce different output than this…

      I agree that there should be a function that does that, but it's not reverse.

      Explain why you think

      my $water_samples = join '', map chr, 113, 101, 769; my $last_sample = substr($water_samples, -1, 1); print(ord($last_sample), "\n");

      should produce different output than

      my $water_samples = join '', map chr, 113, 101, 769; $water_samples = reverse($water_samples); my $last_sample = substr($water_samples, 0, 1); print(ord($last_sample), "\n");

      Update: Corrected my numbers.

        Your rhetorical volley is irrelevant and off-topic as it has nothing whatever to do with Unicode.

        My simple, clear, plain point is that the Perl reverse function is broken because when it reverses what is simply, clearly, plainly a Unicode string with grapheme clusters or jamo in it, it gets it wrong.

        Perl's built-in string reverse function is not fully Unicode-conformant. That's all.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://883864]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2014-11-27 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (190 votes), past polls